Systems and methods for mental health assessment

ABSTRACT

The present disclosure provides systems and methods for assessing a mental state of a subject in a single session or over multiple different sessions, using for example an automated module to present and/or formulate at least one query based in part on one or more target mental states to be assessed. The query may be configured to elicit at least one response from the subject. The query may be transmitted in an audio, visual, and/or textual format to the subject to elicit the response. Data comprising the response from the subject can be received. The data can be processed using one or more individual, joint, or fused models. One or more assessments of the mental state associated with the subject can be generated for the single session, for each of the multiple different sessions, or upon completion of one or more sessions of the multiple different sessions.

CROSS-REFERENCE

This application is a continuation application of U.S. patentapplication Ser. No. 17/129,859, filed Dec. 21, 2020, which is acontinuation of U.S. patent application Ser. No. 16/918,624, filed Jul.1, 2020, which is a continuation of U.S. patent application Ser. No.16/560,720, filed Sep. 4, 2019, now U.S. Pat. No. 10,748,644, which is acontinuation of U.S. application Ser. No. 16/523,298, filed on Jul. 26,2019, which is a continuation of U.S. International Application No.PCT/US2019/037953, filed on Jun. 19, 2019, which claims priority to U.S.Provisional Application No. 62/687,176, filed Jun. 19, 2018, U.S.Provisional Application No. 62/749,113, filed Oct. 22, 2018, U.S.Provisional Application No. 62/749,654, filed Oct. 23, 2018, U.S.Provisional Application No. 62/749,663, filed Oct. 23, 2018, U.S.Provisional Application No. 62/749,669, filed Oct. 23, 2018, U.S.Provisional Application No. 62/749,672, filed Oct. 24, 2018, U.S.Provisional Application No. 62/754,534, filed Nov. 1, 2018, U.S.Provisional Application No. 62/754,541, filed Nov. 1, 2018, U.S.Provisional Application No. 62/754,547, filed Nov. 1, 2018, U.S.Provisional Application No. 62/755,356, filed Nov. 2, 2018, U.S.Provisional Application No. 62/755,361, filed Nov. 2, 2018, U.S.Provisional Application No. 62/733,568, filed Sep. 19, 2018, U.S.Provisional Application No. 62/733,552, filed Sep. 19, 2018, each ofwhich is incorporated herein by reference in their entirety for allpurposes.

BACKGROUND

Behavioral health is a serious problem. In the United States, suicideranks in the top 10 causes of death as reported by the Center forDisease Control (CDC). Depression is the leading cause of disabilityworldwide, according to the World Health Organization (WHO). Screeningfor depression and other mental health disorders by doctors and healthservice providers is widely recommended. The current “gold standard” forscreening or monitoring for depression in patients is the PHQ-9 (PatientHealth Questionnaire 9), a written depression screening or monitoringtest with nine (9) multiple-choice questions. Other similar assessmenttests include the PHQ-2 and the Generalized Anxiety Disorder 7 (GAD-7).

Many believe the PHQ-9 and other, similar screening or monitoring toolsfor detecting behavioral health diagnoses such as depression areinadequate. While the PHQ-9 is purported to successfully detectdepression in 85-95% of patients, it is also purported that 54% of allsuicides are committed by people with no diagnosis of depression. Thesetwo assertions appear entirely inconsistent with each other, screeningor monitoring but, the reality is that not enough people are beingscreened.

Part of the problem is that, traditional screening or monitoring surveysare not engaging due to their repetitive nature and lack ofpersonalization. Another problem is that patients can be dishonest intheir responses to the assessment tool, and the PHQ-9 and similar toolsprovide no mechanism by which dishonesty in the patient's responses canbe assessed. Finally, it takes effort on the part of the clinician andthe patient for these surveys, as some patients need assistance fortheir completion, and this disrupts both the clinician and patientworkflows.

SUMMARY

The present disclosure provides systems and methods that can moreaccurately and effectively assess, screen, estimate, and/or monitor themental state of human subjects, when compared to conventional mentalhealth assessment tools. In one aspect, a method for assessing a mentalstate of a subject in a single session or over multiple differentsessions is provided. The method can comprise using an automated moduleto present and/or formulate at least one query based in part on one ormore target mental states to be assessed. The at least one query can beconfigured to elicit at least one response from the subject. The methodmay also comprise transmitting the at least one query in an audio,visual, and/or textual format to the subject to elicit the at least oneresponse. The method may also comprise receiving data comprising the atleast one response from the subject in response to transmitting the atleast one query. The data can comprise speech data. The method mayfurther comprise processing the data using one or more individual,joint, or fused models comprising a natural language processing (NLP)model, an acoustic model, and/or a visual model. The method may furthercomprise generating, for the single session, for each of the multipledifferent sessions, or upon completion of one or more sessions of themultiple different sessions, one or more assessments of the mental stateassociated with the subject.

In some embodiments, the one or more individual, joint, or fused modelsmay comprise a metadata model. The metadata model can be configured touse demographic information and/or a medical history of the subject togenerate the one or more assessments of the mental state associated withthe subject.

In some embodiments, the at least one query can comprise a plurality ofqueries and the at least one response can comprise a plurality ofresponses. The plurality of queries can be transmitted in a sequentialmanner to the subject and configured to systematically elicit theplurality of responses from the subject. In some embodiments, theplurality of queries can be structured in a hierarchical manner suchthat each subsequent query of the plurality of queries is structured asa logical follow on to the subject's response to a preceding query, andcan be designed to assess or draw inferences on a plurality of aspectsof the mental state of the subject.

In some embodiments, the automated module can be further configured topresent and/or formulate the at least one query based in part on aprofile of the subject.

In some embodiments, the one or more target mental states can beselected from the group consisting of depression, anxiety,post-traumatic stress disorder (PTSD), schizophrenia, suicidality, andbipolar disorder.

In some embodiments, the one or more target mental states can compriseone or more conditions or disorders associated or comorbid with a listof predefined mental disorders. The list of predefined mental disordersmay include mental disorders as defined or provided in the Diagnosticand Statistical Manual of Mental Disorders. In some embodiments, the oneor more associated or comorbid conditions or disorders can comprisefatigue, loneliness, low motivation, or stress.

In some embodiments, the assessment can comprise a score that indicateswhether the subject is (i) more likely than others to experience atleast one of the target mental states or (ii) more likely than others toexperience at least one of the target mental states at a future point intime. In some embodiments, the future point in time can be within aclinically actionable future.

In some embodiments, the method can further comprise: transmitting theassessment to a healthcare provider to be used in evaluating the mentalstate of the subject. The transmitting can be performed in real-timeduring the assessment, just-in-time, or after the assessment has beencompleted.

In some embodiments, the plurality of queries can be designed to testfor or detect a plurality of aspects of the mental state of the subject.

In some embodiments, the assessment can comprise a score that indicateswhether the subject is (i) more likely than others to experience atleast one of the target mental states or (ii) more likely than others toexperience at least one of the target mental states at a future point intime. The score can be calculated based on processed data obtained fromthe subject's plurality of responses to the plurality of queries. Insome embodiments, the score can be continuously updated with processeddata obtained from each of the subject's follow-on response to apreceding query.

In some embodiments, the method can further comprise based on the atleast one response, identifying additional information to be elicitedfrom the subject. The method can further comprise transmitting asubsequent query to the subject. The subsequent query relates to theadditional information and can be configured to elicit a subsequentresponse from the subject. The method can further comprise receivingdata comprising the subsequent response from the subject in response totransmitting the subsequent query. The method can further compriseprocessing the subsequent response to update the assessment of themental state of the subject. In some embodiments, identifying additionalinformation to be elicited from the subject can comprise: identifying(i) one or more elements of substantive content or (ii) one or morepatterns in the data that are material to the mental state of thesubject. The method can further comprise: for each of the one or moreelements of substantive content or the one or more patterns: identifyingone or more items of follow-up information that are related to the oneor more elements or the one or more patterns to be asked of the subject,and generating a subsequent query. The subsequent query can relate tothe one or more items of follow-up information.

In some embodiments, the NLP model can be selected from the groupconsisting of a sentiment model, a statistical language model, a topicmodel, a syntactic model, an embedding model, a dialog or discoursemodel, an emotion or affect model, and a speaker personality model.

In some embodiments, the data can further comprise images or video ofthe subject. The data can be further processed using the visual model togenerate the assessment of the mental state of the subject. In someembodiments, the visual model can be selected from the group consistingof a facial cue model, a body movement/motion model, and an eye activitymodel.

In some embodiments, the at least one query can be transmitted in aconversational context in a form of a question, statement, or commentthat is configured to elicit the at least one response from the subject.In some embodiments, the conversational context can be designed topromote elicitation of truthful, reflective, thoughtful, or candidresponses from the subject. In some embodiments, the conversationalcontext can be designed to affect an amount of time that the subjecttakes to compose the at least one response. In some embodiments, themethod can further comprise: transmitting one or more prompts in theaudio and/or visual format to the subject when a time latency thresholdis exceeded. In some embodiments, the conversational context can bedesigned to enhance one or more performance metrics of the assessment ofthe mental state of the subject. In some embodiments, the one or moreperformance metrics can be selected from the group consisting of an F1score, an area under the curve (AUC), a sensitivity, a specificity, apositive predictive value (PPV), and an equal error rate.

In some embodiments, the at least one query is not or need not betransmitted or provided in a format of a standardized test orquestionnaire. In some embodiments, the at least one query can comprisesubject matter that has been adapted or modified from a standardizedtest or questionnaire. In some embodiments, the standardized test orquestionnaire can be selected from the group consisting of PHQ-9, GAD-7,HAM-D, and BDI. The standardized test or questionnaire can be anothersimilar test or questionnaire for assessing a patient's mental healthstate.

In some embodiments, the one or more individual, joint, or fused modelscan comprise a regression model.

In some embodiments, the at least one query can be designed to beopen-ended without limiting the at least one response from the subjectto be a binary yes-or-no response.

In some embodiments, the score can be used to calculate one or morescores with a clinical value.

In some embodiments, the assessment can comprise a quantized scoreestimate of the mental state of the subject. In some embodiments, thequantized score estimate can comprise a calibrated score estimate. Insome embodiments, the quantized score estimate can comprise a binaryscore estimate.

In some embodiments, the plurality of queries can be represented as aseries of edges and the plurality of responses can be represented as aseries of nodes in a nodal network.

In some embodiments, the mental state can comprise one or more medical,psychological, or psychiatric conditions or symptoms.

In some embodiments, the method can be configured to further assess aphysical state of the subject as manifested based on the speech data ofthe subject. The method can further comprise: processing the data usingthe one or more individual, joint, or fused models to generate anassessment of the physical state of the subject. The assessment of thephysical state can comprise a score that indicates whether the subjectis (i) more likely than others to experience at least one of a pluralityof physiological conditions or (ii) more likely than others toexperience at least one of the physiological conditions at a futurepoint in time.

In some embodiments, the physical state of the subject is manifested dueto one or more physical conditions that affect a characteristic or aquality of voice of the subject.

In some embodiments, the automated module can be a mental healthscreening module that can be configured to dynamically formulate the atleast one query based in part on the one or more target mental states tobe assessed.

In some embodiments, the one or more individual, joint, or fused modelscan comprise a composite model that can be an aggregate of two or moredifferent models.

Another aspect of the present disclosure provides a non-transitorycomputer readable-medium comprising machine-executable instructionsthat, upon execution by one or more computer processors, implements anyof the foregoing methods described in the above or elsewhere herein.

Another aspect of the present disclosure provides a system comprisingone or more computer processors and memory comprising machine-executableinstructions that, upon execution by the one or more computerprocessors, implements any of the methods foregoing described in theabove or elsewhere herein.

Another aspect of the present disclosure provides a method for screeningor monitoring a subject for, or diagnosing the subject with a mentalhealth disorder. The method can comprise: transmitting at least onequery to the subject. The at least one query can be configured to elicitat least one response from the subject. The method can further comprisereceiving data comprising the at least one response from the subject inresponse to transmitting the at least one query. The data can comprisespeech data. The method can further comprise processing the data usingone or more individual, joint, or fused models comprising a naturallanguage processing (NLP) model, an acoustic model, and/or a visualmodel to generate an output. The method can further comprise using atleast the output to generate a score and a confidence level of thescore. The score can comprise an estimate that the subject has themental health disorder. The confidence level can be based at least inpart on a quality of the speech data and represents a degree to whichthe estimate can be trusted.

In some embodiments, the one or more individual, joint, or fused modelscan comprise a metadata model. The metadata model can be configured touse demographic information and/or a medical history of the subject togenerate the one or more assessments of the mental state associated withthe subject.

In some embodiments, the output can comprise an NLP output, an acousticoutput, and a visual output. In some embodiments, the NLP output, theacoustic output, and the visual output can each comprise a plurality ofoutputs corresponding to different time ranges of the data. In someembodiments, generating the score can comprise: (i) segmenting the NLPoutput, the acoustic output, and the visual output into discrete timesegments, (ii) assigning a weight to each discrete time segment, and(iii) computing a weighted average of the NLP output, the acousticoutput, and the visual output using the assigned weights. In someembodiments, the weights can be based at least on (i) base weights ofthe one or more individual, joint, or fused models (ii) a confidencelevel of each discrete time segment of the NLP output, the acousticoutput, and the visual output.

In some embodiments, the one or more individual, joint, or fused modelscan be interdependent such that each of the one or more individual,joint, or fused models is conditioned on an output of at least one otherof the one or more individual, joint, or fused models.

In some embodiments, generating the score can comprise fusing the NLPoutput, the acoustic output, and the visual output.

In some embodiments, generating the confidence level of the score cancomprise fusing (i) a confidence level of the NLP output with (ii) aconfidence level of the acoustic output.

In some embodiments, the method can further comprise converting thescore into one or more scores with a clinical value.

In some embodiments, the method can further comprise transmitting theone or more scores with a clinical value to the subject and/or a contactfor the subject. In some embodiments, the method can further comprisetransmitting the one or more scores with a clinical value to ahealthcare provider for use in evaluating and/or providing care for amental health of the subject. In some embodiments, the transmitting cancomprise transmitting the one or more scores with a clinical value tothe healthcare provider during the screening, monitoring, or diagnosing.In some embodiments, the transmitting can comprise transmitting the oneor more scores with a clinical value to the healthcare provider or apayer after the screening, monitoring, or diagnosing has been completed.

In some embodiments, the at least one query can comprise a plurality ofqueries, the at least one response can comprise a plurality ofresponses. Generating the score can comprise updating the score afterreceiving each of the plurality of responses, and the method can furthercomprise: converting the score to one or more scores with a clinicalvalue after each of the updates. The method can further comprisetransmitting the one or more scores with a clinical value to ahealthcare provider after the converting.

In some embodiments, the method can further comprise: determining thatthe confidence level does not satisfy a predetermined criterion, in realtime and based at least in part on the at least one response, generatingat least one additional query, and using the at least one additionalquery, repeating steps (a)-(d) until the confidence level satisfies thepredetermined criterion.

In some embodiments, the confidence level can be based on a length ofthe at least one response. In some embodiments, the confidence level canbe based on an evaluated truthfulness of the one or more responses ofthe subject.

In some embodiments, the one or more individual, joint, or fused modelscan be trained on speech data from a plurality of test subjects, whereineach of the plurality of test subjects has completed a survey orquestionnaire that indicates whether the test subject has the mentalhealth disorder. The confidence level can be based on an evaluatedtruthfulness of responses in the survey or questionnaire.

In some embodiments, the method can further comprise extracting from thespeech data one or more topics of concern of the subject using a topicmodel.

In some embodiments, the method can further comprise generating a wordcloud from the one or more topics of concern. The word cloud reflectschanges in the one or more topics of concern of the subject over time.In some embodiments, the method can further comprise transmitting theone or more topics of concern to a healthcare provider, the subject, orboth.

In some embodiments, the video output can be assigned a higher weightthan the NLP output and the acoustic output in generating the score whenthe subject is not speaking. In some embodiments, a weight of the videooutput in generating the score can be increased when the NLP output andthe acoustic output indicate that a truthfulness level of the subject isbelow a threshold.

In some embodiments, the video model can comprise one or more of afacial cue model, a body movement/motion model, and a gaze model.

In some embodiments, the at least one query can comprise a plurality ofqueries and the at least one response can comprise a plurality ofresponses. The plurality of queries can be configured to sequentiallyand systematically elicit the plurality of responses from the subject.The plurality of queries can be structured in a hierarchical manner suchthat each subsequent query of the plurality of queries can be a logicalfollow on to the subject's response to a preceding query and can bedesigned to assess or draw inferences about different aspects of themental state of the subject.

In some embodiments, the at least one query can include subject matterthat has been adapted or modified from a clinically-validated survey,test or questionnaire.

In some embodiments, the acoustic model can comprise one or more of anacoustic embedding model, a spectral-temporal model, a supervectormodel, an acoustic affect model, a speaker personality model, anintonation model, a speaking rate model, a pronunciation model, anon-verbal model, or a fluency model.

In some embodiments, the NLP model can comprise one or more of asentiment model, a statistical language model, a topic model, asyntactic model, an embedding model, a dialog or discourse model, anemotion or affect model, or a speaker personality model.

In some embodiments, the mental health disorder can comprise depression,anxiety, post-traumatic stress disorder, bipolar disorder, suicidalityor schizophrenia.

In some embodiments, the mental health disorder can comprise one or moremedical, psychological, or psychiatric conditions or symptoms.

In some embodiments, the score can comprise a score selected from arange. The range can be normalized with respect to a general populationor to a specific population of interest.

In some embodiments, the one or more scores with a clinical value cancomprise one or more descriptors associated with the mental healthdisorder.

In some embodiments, steps (a)-(d) as described above can be repeated ata plurality of different times to generate a plurality of scores. Themethod can further comprise: transmitting the plurality of scores andconfidences to a computing device and graphically displaying, on thecomputing device, the plurality of scores and confidences as a functionof time on a dashboard or other representation for one or more endusers.

In some embodiments, the quality of the speech data can comprise aquality of an audio signal of the speech data.

In some embodiments, the quality of the speech data can comprise ameasure of confidence of a speech recognition process performed on anaudio signal of the speech data.

In some embodiments, the method can be implemented for a single session.The score and the confidence level of the score can be generated for thesingle session.

In some embodiments, the method can be implemented for and over multipledifferent sessions, and the score and the confidence level of the scorecan be generated for each of the multiple different sessions, or uponcompletion of one or more sessions of the multiple different sessions.

Another aspect of the present disclosure provides a non-transitorycomputer readable-medium comprising machine-executable instructionsthat, upon execution by one or more computer processors, implements anyof the methods described in the above or elsewhere herein.

Another aspect of the present disclosure provides a system comprisingone or more computer processors and memory comprising machine-executableinstructions that, upon execution by the one or more computerprocessors, implements any of the methods described above or elsewhereherein.

Another aspect of the present disclosure provides a method forprocessing speech and/or video data of a subject to identify a mentalstate of the subject. The method can comprise: receiving the speechand/or video data of the subject and using at least one processingtechnique to process the speech and/or video data to identify the mentalstate at (i) a reduced error rate of at least 10% lower or (ii) anaccuracy of at least 10% higher, than a standardized mental healthquestionnaire or testing tool usable for identifying the mental state.The reduced error rate or the accuracy can be established relative to atleast one or more benchmark standards usable by an entity foridentifying or assessing one or more medical conditions comprising themental state.

In some embodiments, the entity can comprise one or more of thefollowing: clinicians, healthcare providers, insurance companies, andgovernment-regulated bodies. In some embodiments, the at least one ormore benchmark standards can comprise at least one clinical diagnosisthat has been independently verified to be accurate in identifying themental state. In some embodiments, the speech data can be receivedsubstantially in real-time as the subject is speaking. In someembodiments, the speech data can be produced in an offline mode from astored recording of the subject's speech.

Another aspect of the present disclosure provides a method forprocessing speech data of a subject to identify a mental state of thesubject. The method can comprise: receiving the speech data of thesubject and using at least one processing technique to process thespeech data to identify the mental state. The identification of themental state is better according to one or more performance metrics ascompared to a standardized mental health questionnaire or testing toolusable for identifying the mental state.

In some embodiments, the one or more performance metrics can comprise asensitivity or specificity, and the speech data can be processedaccording to a desired level of sensitivity or a desired level ofspecificity. In some embodiments, the desired level of sensitivity orthe desired level of specificity can be defined based on criteriaestablished by an entity. In some embodiments, the entity can compriseone or more of the following: clinicians, healthcare providers, personalcaregivers, insurance companies, and government-regulated bodies.

Another aspect of the present disclosure provides a method forprocessing speech data of a subject to identify or assess a mental stateof the subject. The method can comprise: receiving the speech data ofthe subject, using one or more processing technique to process thespeech data to generate one or more descriptors indicative of the mentalstate, and generating a plurality of visual elements of the one or moredescriptors. The plurality of visual elements can be configured to bedisplayed on a graphical user interface of an electronic device of auser and usable by the user to identify or assess the mental state.

In some embodiments, the user can be the subject. In some embodiments,the user can be a clinician or healthcare provider. In some embodiments,the one or more descriptors can comprise a calibrated or normalizedscore indicative of the mental state. In some embodiments, the one ormore descriptors further can comprise a confidence associated with thecalibrated or normalized score.

Another aspect of the present disclosure provides a method foridentifying, assessing, or monitoring a mental state of a subject. Themethod can comprise using a natural language processing algorithm, anacoustic processing algorithm, or a video processing algorithm toprocess data of the subject to identify or assess the mental state of asubject, the data comprising speech or video data of the subject, andoutputting a report indicative of the mental state of the subject. Thereport can be transmitted to a user to be used for identifying,assessing, or monitoring the mental state.

In some embodiments, the user can be the subject. In some embodiments,the user can be a clinician or healthcare provider. In some embodiments,the report can comprise a plurality of graphical visual elements. Insome embodiments, the report can be configured to be displayed on agraphical user interface of an electronic device of the user. In someembodiments, the method can further comprise: updating the report inresponse to one or more detected changes in the mental state of thesubject. In some embodiments, the report can be updated substantially inreal time as the one or more detected changes in the mental state areoccurring in the subject.

Another aspect of the present disclosure provides a method foridentifying whether a subject is at risk of a mental or physiologicalcondition. The method can comprise: obtaining speech data from thesubject and storing the speech data in computer memory, processing thespeech data using in part natural language processing to identify one ormore features indicative of the mental or physiological condition, andoutputting an electronic report identifying whether the subject is at arisk of the mental or physiological condition, and the risk can bequantified in a form of a normalized score with a confidence level. Thenormalized score with the confidence level can be usable by a user toidentify whether the subject is at a risk of the mental or physiologicalcondition.

In some embodiments, the user can be the subject. In some embodiments,the user can be a clinician or healthcare provider. In some embodiments,the report can comprise a plurality of graphical visual elements. Insome embodiments, the report can be configured to be displayed on agraphical user interface of an electronic device of the user.

Another aspect of the present disclosure provides a method foridentifying, assessing, or monitoring a mental state or disorder of asubject. The method can comprise: receiving audio or audio-visual datacomprising speech of the subject in computer memory and processing theaudio or audio-visual data to identify, assess, monitor, or diagnose themental state or disorder of the subject, which processing can compriseperforming natural language processing on the speech of the subject.

In some embodiments, the audio or audio-visual data can be received inresponse to a query directed to the subject. In some embodiments, theaudio or audio-visual data can be from a prerecording of a conversationto which the subject can be a party. In some embodiments, the audio oraudio-visual data can be from a prerecording of a clinical sessioninvolving the subject and a healthcare provider. In some embodiments,the mental state or disorder can be identified at a higher performancelevel compared to a standardized mental health questionnaire or testingtool. In some embodiments, the processing further can comprise using atrained algorithm to perform acoustic analysis on the speech of thesubject.

Another aspect of the present disclosure provides a method forestimating whether a subject has a mental condition and providing theestimate to a stakeholder. The method can comprise: obtaining speechdata from the subject and storing the speech data in computer memory.The speech data can comprise responses to a plurality of queriestransmitted in an audio and/or visual format to the subject. The methodcan further comprise selecting (1) a first model optimized forsensitivity in estimating whether the subject has the mental conditionor (2) a second model optimized for specificity in estimating whetherthe subject has the mental condition. The method can further compriseprocessing the speech data using the selected first model or the secondmodel to generate the estimate. The method can further comprisetransmitting the estimate to the stakeholder.

In some embodiments, the first model can be selected and the stakeholdercan be a healthcare payer. In some embodiments, the second model can beselected and the stakeholder can be a healthcare provider.

Another aspect of the present disclosure provides a system fordetermining whether a subject can be at risk of having a mentalcondition. The system can be configured to (i) receive the speech datafrom the memory and (ii) process the speech data using at least onemodel to determine that the subject is at risk of having the mentalcondition. The at least one model can be trained on speech data from aplurality of other test subjects who have a clinical determination ofthe mental condition. The clinical determinations may serve as labelsfor the speech data. The system can be configured to generate theestimate of the mental condition that is better according to one or moreperformance metrics as compared to a clinically-validated survey, testor questionnaire.

In some embodiments, the system can be configured to generate theestimate of the mental condition with a higher specificity compared tothe clinically-validated survey, test or questionnaire. In someembodiments, the system can be configured to generate the estimate ofthe mental condition with a higher sensitivity compared to theclinically-validated survey, test, or questionnaire. In someembodiments, the identification can be output while the subject isspeaking. In some embodiments, the identification can be output viastreaming or a periodically updated signal.

Another aspect of the present disclosure provides a method for assessinga mental state of a subject. The method can comprise using an automatedscreening module to dynamically formulate at least one query based inpart on one or more target mental states to be assessed. The at leastone query can be configured to elicit at least one response from thesubject. The method can further comprise transmitting the at least onequery in an audio and/or visual format to the subject to elicit the atleast one response. The method can further comprise receiving datacomprising the at least one response from the subject in response totransmitting the at least one query. The data can comprise speech data.The method can further comprise processing the data using a compositemodel comprising at least one or more semantic models to generate anassessment of the mental state of the subject.

Another aspect of the present disclosure provides a non-transitorycomputer readable medium comprising machine executable code that, uponexecution by one or more computer processors, implements any of themethods above or elsewhere herein.

Another aspect of the present disclosure provides a system comprisingone or more computer processors and computer memory coupled thereto. Thecomputer memory comprises machine executable code that, upon executionby the one or more computer processors, implements any of the methodsabove or elsewhere herein.

Additional aspects and advantages of the present disclosure will becomereadily apparent to those skilled in this art from the followingdetailed description, wherein only illustrative embodiments of thepresent disclosure are shown and described. As will be realized, thepresent disclosure is capable of other and different embodiments, andits several details are capable of modifications in various obviousrespects, all without departing from the disclosure. Accordingly, thedrawings and description are to be regarded as illustrative in nature,and not as restrictive.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.To the extent publications and patents or patent applicationsincorporated by reference contradict the disclosure contained in thespecification, the specification is intended to supersede and/or takeprecedence over any such contradictory material.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the present disclosure are set forth withparticularity in the appended claims. A better understanding of thefeatures and advantages of the present disclosure will be obtained byreference to the following detailed description that sets forthillustrative embodiments, in which the principles of the disclosure areutilized, and the accompanying drawings (also “Figure” and “FIG.”herein), of which:

FIG. 1A shows a health screening or monitoring system in which a healthscreening or monitoring server and a clinical data server computersystem and a social data server cooperate to estimate a health state ofa patient in accordance with the present disclosure;

FIG. 1B shows an additional embodiment of the health screening ormonitoring system from FIG. 1A;

FIG. 2 shows a patient screening or monitoring system in which a webserver and modeling server(s) cooperate to assess a state of a patientthrough a wide area network, in accordance with some embodiments;

FIG. 3 shows a patient assessment system in which a real-time computersystem, a modeling computer system, and a clinical and demographic dataserver computer system that cooperate to assess a state of a patient andreport the assessed state to a clinician using a clinician devicethrough a wide area network in accordance with the present disclosure.

FIG. 4 is a block diagram of the health screening or monitoring serverof FIG. 1A in greater detail;

FIG. 5 is a block diagram of interactive health screening or monitoringlogic of the health screening or monitoring server of FIG. 4 in greaterdetail;

FIG. 6 is a block diagram of interactive screening or monitoring serverlogic of the interactive health screening or monitoring logic of FIG. 5in greater detail;

FIG. 7 is a block diagram of generalized dialogue flow logic of theinteractive screening or monitoring server logic of FIG. 6 in greaterdetail;

FIG. 8 is a logic flow diagram illustrating the control of aninteractive spoken conversation with the patient by the generalizeddialogue flow logic in accordance with the present disclosure;

FIG. 9 is a block diagram of a question and adaptive action bank of thegeneralized dialogue flow logic of FIG. 7 in greater detail;

FIG. 10 is a logic flow diagram of a step of FIG. 8 in greater detail;

FIG. 11 is a block diagram of question management logic of the questionand adaptive action bank of FIG. 9 in greater detail;

FIG. 12 is a logic flow diagram of determination of the quality of aquestion in accordance with the present disclosure;

FIG. 13 is a logic flow diagram of determination of the equivalence oftwo questions in accordance with the present disclosure;

FIG. 14 is a logic flow diagram illustrating the control of aninteractive spoken conversation with the patient by the real-time systemin accordance with the present disclosure;

FIGS. 15 and 16 are each a logic flow diagram of a respective step ofFIG. 14 in greater detail.

FIG. 17 is a transaction flow diagram showing an illustrative example ofa spoken conversation with, and controlled by, the real-time system ofFIG. 3.

FIG. 18 is a block diagram of runtime model server logic of theinteractive health screening or monitoring logic of FIG. 3 in greaterdetail;

FIG. 19 is a block diagram of model training logic of the interactivehealth screening or monitoring logic of FIG. 1A in greater detail;

FIG. 20A shows a greater detailed block diagram of the patient screeningor monitoring system, in accordance with some embodiments;

FIG. 20B provides a block diagram of the runtime model server(s), inaccordance with some embodiments;

FIG. 21 provides a block diagram of the model training server(s), inaccordance with some embodiments;

FIG. 22 shows the real-time computer system and the modeling computersystem of FIG. 3 in greater detail, including a general flow of data.

FIG. 23A provides a block diagram of the acoustic model, in accordancewith some embodiments;

FIG. 23B shows an embodiment of FIG. 23A including an acoustic modelingblock;

FIG. 23C shows a score calibration and confidence module;

FIG. 24 provides a simplified example of the high level featurerepresentor of the acoustic model, for illustrative purposes;

FIG. 25 provides a block diagram of the Natural Language Processing(NLP) model, in accordance with some embodiments;

FIG. 26 provides a block diagram of the visual model, in accordance withsome embodiments;

FIG. 27 provides a block diagram of the descriptive features, inaccordance with some embodiments;

FIG. 28 provides a block diagram of the interaction engine, inaccordance with some embodiments;

FIG. 29 is a logic flow diagram of the example process of testing apatient for a mental health condition, in accordance with someembodiments;

FIG. 30 is a logic flow diagram of the example process of modeltraining, in accordance with some embodiments;

FIG. 31 is a logic flow diagram of the example process of modelpersonalization, in accordance with some embodiments;

FIG. 32 is a logic flow diagram of the example process of clientinteraction, in accordance with some embodiments;

FIG. 33 is a logic flow diagram of the example process of classifyingthe mental state of the client, in accordance with some embodiments;

FIG. 34 is a logic flow diagram of the example process of modelconditioning, in accordance with some embodiments;

FIG. 35 is a logic flow diagram of the example process of modelweighting and fusion, in accordance with some embodiments;

FIG. 36 is a logic flow diagram of the example simplified process ofacoustic analysis, provided for illustrative purposes only;

FIG. 37 is a block diagram showing speech recognition logic of themodeling computer system in greater detail;

FIG. 38 is a block diagram showing language model training logic of themodeling computer system in greater detail;

FIG. 39 is a block diagram showing language model logic of the modelingcomputer system in greater detail;

FIG. 40 is a block diagram showing acoustic model training logic of themodeling computer system in greater detail;

FIG. 41 is a block diagram showing acoustic model logic of the modelingcomputer system in greater detail;

FIG. 42 is a is a block diagram showing visual model training logic ofthe modeling computer system in greater detail;

FIG. 43 is a block diagram showing visual model logic of the modelingcomputer system in greater detail;

FIG. 44 is a block diagram of a screening or monitoring system datastore of the interactive health screening or monitoring logic of FIG. 1Ain greater detail;

FIG. 45 shows a health screening or monitoring system in which a healthscreening or monitoring server estimates a health state of a patient bypassively listening to ambient speech in accordance with the presentdisclosure;

FIG. 46 is a logic flow diagram illustrating the estimation a healthstate of a patient by passively listening to ambient speech inaccordance with the present disclosure;

FIG. 47 is a logic flow diagram illustrating the estimation a healthstate of a patient by passively listening to ambient speech inaccordance with the present disclosure.

FIG. 48 is a block diagram of health care management logic of the healthscreening or monitoring server of FIG. 4 in greater detail.

FIGS. 49 and 50 are respective block diagrams of component conditionsand actions of work-flows of the health care management logic of FIG.48.

FIG. 51 is a logic flow diagram of the automatic formulation of awork-flow of the health care management logic of FIG. 48 in accordancewith the present disclosure;

FIG. 52 is a block diagram of the real-time computer system of FIG. 3 ingreater detail;

FIG. 53 is a block diagram of the modeling computer system of FIG. 3 ingreater detail;

FIG. 54 is a block diagram of the health screening or monitoring serverof FIG. 1A in greater detail.

FIGS. 55 and 56 provide example illustrations of spectrograms of anacoustic signal used for analysis, in accordance with some embodiments;

FIGS. 57 and 58 are example illustrations of a computer system capableof embodying the current disclosure;

FIG. 59 shows a precision case management use case for the system;

FIG. 60 shows a primary care screening or monitoring use case for thesystem;

FIG. 61 shows a system for enhanced employee assistance plan (EAP)navigation and triage; and

FIG. 62 shows a computer system that is programmed or otherwiseconfigured to assess a mental state of a subject in a single session orover multiple different sessions.

DETAILED DESCRIPTION

While various embodiments of the invention have been shown and describedherein, it will be obvious to those skilled in the art that suchembodiments are provided by way of example only. Numerous variations,changes, and substitutions may occur to those skilled in the art withoutdeparting from the invention. It should be understood that variousalternatives to the embodiments of the invention described herein may beemployed.

Aspects, features and advantages of exemplary embodiments of the presentinvention will become better understood with regard to the followingdescription in connection with the ac-companying drawing(s). It shouldbe apparent to those skilled in the art that the described embodimentsof the present invention provided herein are illustrative only and notlimiting, having been presented by way of example only. All featuresdisclosed in this description may be replaced by alternative featuresserving the same or similar purpose, unless expressly stated otherwise.Therefore, numerous other embodiments of the modifications thereof arecontemplated as falling within the scope of the present invention asdefined herein and equivalents thereto.

Henceforth, use of absolute and/or sequential terms, such as, forexample, “will,” “will not,” “shall,” “shall not,” “must,” “must not,”“first,” “initially,” “next,” “subsequently,” “before,” “after,”“lastly,” and “finally,” are not meant to limit the scope of the presentinvention as the embodiments disclosed herein are merely exemplary. Thepresent invention relates to health screening or monitoring systems,and, more particularly, to a computer-implemented mental healthscreening or monitoring tool with significantly improved accuracy andefficacy by leveraging language analysis, visual cues and acousticanalysis. In this application, the specifics of improved acoustic,visual and speech analysis techniques are described as they pertain tothe classification of a respondent as being depressed, or other mentalstate of interest. While much of the following disclosure will focuslargely on assessing depression in a patient, the systems and methodsdescribed herein may be equally adept at screening or monitoring a userfor a myriad of mental and physical ailments. For example, bipolardisorder, anxiety, and schizophrenia are examples of mental ailmentsthat such a system may be adept at screening or monitoring for. It isalso possible that physical ailments may be assessed utilizing suchsystems. It should be understood that while this disclosure may focusheavily upon depression screening or monitoring, this is not limiting.Any suitable mental or physical ailment may be screened using thedisclosed systems and methods.

The systems and methods disclosed herein may use natural languageprocessing (NLP) to perform semantic analysis on patient speechutterances. Semantic analysis, as disclosed herein, may refer toanalysis of spoken language from patient responses to assessmentquestions or captured conversations, in order to determine the meaningof the spoken language for the purpose of conducting a mental healthscreening or monitoring of the patient. The analysis may be of words orphrases, and may be configured to account for primary queries orfollow-up queries. In the case of captured human-human conversations,the analysis may also apply to the speech of the other party. As usedherein, the terms “semantic analysis” and “natural language processing(NLP)” may be used interchangeably. Semantic analysis may be used todetermine the meanings of utterances by patients, in context. It mayalso be used to determine topics patients are speaking about.

A mental state, as described herein, may be distinguished from anemotion or feeling, such as happiness, sadness, or anger. A mental statemay include one or more feelings in combination with a philosophy ofmind, including how a person perceives objects of his or her environmentand the actions of other people toward him or her. While feelings may betransient, a mental state may describe a person's overarchingdisposition or mood, even in situations where the person's feelings maychange. For example, a depressed person may variously feel, at differenttimes, happy, sad, or angry.

In accordance with one or more embodiments of the present invention, aserver computer system (health screening or monitoring server 102—FIG.1A) may apply a health state screening or monitoring test to a humanpatient using a client device (patient device 112), by engaging thepatient in an interactive spoken conversation and applying a compositemodel, that may combine language, acoustic, metadata, and visual models,to a captured audiovisual signal of the patient engaged in the dialogue.While the general subject matter of the screening or monitoring test maybe similar to subject matter of standardized depression screening ormonitoring tests such as the PHQ-9, the composite model may analyze, inreal time, the audiovisual signal of the patient (i) to make theconversation more engaging for the patient and (ii) estimate thepatient's health. Appendix A illustrates an exemplary implementationthat includes Calendaring, SMS, Dialog, Calling and User ManagementServices. While the latter goal is primary, the former goal is asignificant factor in achieving the latter. Truthfulness of the patientin answering questions posed by the screening or monitoring test iscritical in assessing the patient's mood. Health screening or monitoringserver 102 encourages patient honesty.

First, the spoken conversation may provide the patient with less time tocompose a disingenuous response to a question rather than simplyresponding honestly to the question. Second, the conversation may feel,to the patient, more spontaneous and personal and may be less annoyingto the patient than a generic questionnaire, as would be provided by,for example, simply administering the PHQ-9. Accordingly, the spokenconversation may not induce or exacerbate resentment in the patient forhaving to answer a questionnaire for the benefit of a doctor or otherclinician. Third, the spoken conversation may be adapted in progress tobe responsive to the patient, reducing the patient's annoyance with thescreening or monitoring test and, in some situations, shortening thescreening or monitoring test. Fourth, the screening or monitoring testas administered by health screening or monitoring server 102additionally may rely on non-verbal aspects of the conversation inaddition to the verbal content of the conversation to assess depressionin the patient. As shown in FIG. 1A, health screening or monitoringsystem 100 may include health screening or monitoring server 102, a callcenter system 104, a clinical data server 106, a social data server 108,a patient device 112, and a clinician device 114 that are connected toone another though a wide area network (WAN) 110, that is the Internetin this illustrative embodiment. In this illustrative embodiment,patient device 112 may also reachable by call center system 104 througha public-switched telephone network (PSTN) 120 or directly. Healthscreening or monitoring server 102 may be a server computer system thatadministers the health screening or monitoring test with the patientthrough patient device 112 and combines a number of language, acoustic,and visual models to produce results 1820 (FIG. 18), using clinical dataretrieved from clinical data server 106, social data retrieved fromsocial data server 108, and patient data collected from past screeningsor monitoring to train the models of runtime model server 304 (FIG. 18).Clinical data server 106 (FIG. 1A) may be a server computer system thatmakes available clinical or demographic data of the patient, includingdiagnoses, medication information, etc., available, e.g., to healthscreening or monitoring server 102, in a manner that is compliant withHIPAA (Health Insurance Portability and Accountability Act of 1996)and/or any other privacy and security policies and regulations such asGDPR and SOC 2. Social data server 106 may be a server computer systemthat makes social data of the patient, including social media posts,online purchases, searches, etc., available, e.g., to health screeningor monitoring server 102. Clinician device 114 may be a client devicethat receives data representing results of the screening or monitoringregarding the patient's health from health screening or monitoringserver 102.

The system may be used to assess the mental state of the subject in asingle session or over multiple sessions. Subsequent sessions may beinformed by assessment results from prior assessments. This may be doneby providing assessment data as inputs to machine learning algorithms orother analysis methods for the subsequent assessments. Each session maygenerate one or more assessments. Individual assessments may alsocompile data from multiple sessions.

FIG. 1B shows an additional embodiment of the health screening ormonitoring system from FIG. 1A. FIG. 1B illustrates a conversationbetween patient 120 and clinician 130. The clinician 130 may record oneor more speech samples from the patient 120 and upload them to the widearea network 110, with the consent of the patient 120. The speechsamples may be analyzed by one or more machine learning algorithms,described elsewhere herein.

FIG. 2 provides an additional embodiment of a health screening ormonitoring system. Health screening or monitoring system 200 may apply ahealth state screening or monitoring test to a human patient using aclient device (clients 260 a-n), by engaging the patient in aninteraction and applying a composite model that combines language,acoustic, and visual models, to a captured audiovisual signal of thepatient engaged in the dialogue.

While the general subject matter of the screening or monitoring test maybe similar to the subject matter of standardized depression screening ormonitoring tests such as the PHQ-9, the composite model can beconfigured to analyze, in real time, the audiovisual signal of thepatient (i) to make the conversation more engaging for the patient, (ii)estimate the patient's mental health, and (iii) provide a judgment freeand less embarrassing experience for the patient, who may already besuffering from anxiety and other mental barriers to receiving properscreening or monitoring from a clinician.

It should be noted that throughout this disclosure a series of terms maybe used interchangeably, and this usage is not intended to limit thescope of the disclosure in any manner. For example, the terms “patient”,“client”, “subject”, “respondent” and “user” may all be employedinterchangeably to refer to the individual being screened for the mentalhealth conditions and/or the device being utilized by this individual tocollect and transmit the audio and visual data that is used to screenthem. Likewise, “semantic analysis” and “NLP” may be usedinterchangeably to reference natural language processing models andelements. In a similar manner, “stakeholders” is employed to refer to awide variety of interested third parties who are not the patient beingscreened. These stakeholders may include physicians, health careproviders, care team members, insurance companies, researchorganizations, and family/relatives of the patient, hospitals, crisiscenters and the like. It should thus be understood that when anotherlabel is employed, such as “physician”, the intention in this disclosureis to reference any number of stakeholders.

The health screening or monitoring system 200 includes a backendinfrastructure designed to administer the screening or monitoringinteraction and analyze the results. This includes one or more modelservers 230 coupled to a web server 240. The web server 240 and modelserver(s) 230 leverage user data 220 which is additionally populated byclinical and social data 210. The clinical data portion may be compiledfrom the healthcare providers, and may include diagnoses, vitalinformation (age, weight, height, blood chemistry, etc.), diseases,medications, lists of clinical encounters (hospitalizations, clinicvisits, Emergency Department visits), clinician records, and the like.This clinical data may be compiled from one or more electronic healthrecord (EHR) systems or Health Information Exchanges (HIE) by way of asecure application protocol, extension or socket. Social data mayinclude information collected from a patient's social networks,including social media postings, from databases detailing patient'spurchases, and from databases containing patient's economic,educational, residential, legal and other social determinants. Thisinformation may be compiled together with additional preference data,metadata, annotations, and voluntarily supplied information, to populatethe user database 220. The model server 230 and web server 240 areadditionally capable of populating and/or augmenting the user data 220with preferences, extracted features and the like.

The backend infrastructure communicates with the clients 260 a-n via anetwork infrastructure 250. Commonly this network may include theinternet, a corporate local area network, private intranet, cellularnetwork, or some combination of these. The clients 260 a-n include aclient device of a person being screened, which accesses the backendscreening or monitoring system and includes a microphone and camera foraudio and video capture, respectively. The client device may be acellular phone, tablet, laptop or desktop equipped with a microphone andoptional camera, smart speaker in the home or other location, smartwatch with a microphone and optional camera, or a similar device.

A client device may collect additional data, such as biometric data. Forexample, smart watches and fitness trackers already have the capabilityof measuring motion, heart rate and sometimes respiratory rate and bloodoxygenation levels and other physiologic parameters. Future smartdevices may record conductivity measurements for tracking perspiration,pH changes in the skin, and other chemical or hormonal changes. Clientdevices may operate in concert to collect data. For example, a phone maycapture the audio and visual data while a Bluetooth paired fitnesstracker may provide body temperature, pulse rate, respiratory rate andmovement data simultaneously.

All of the collected data for each client 260 a-n is provided back tothe web server 240 via the network infrastructure 250. After processing,results are provided back to the client 260 a-n for consumption, andwhen desired for sharing with one or more stakeholders 270 a-nassociated with the given client 260 a-n, respectively. In this examplefigure, the stakeholders 270 a-n are illustrated as being in directcommunication with their respective clients 260 a-n. While in practicethis may indeed be possible, often the stakeholder 270 a-n will becapable of direct access to the backend screening or monitoring systemvia the network infrastructure 250 and web server 240, without the needto use the client 260 a-n as an intermediary. FIG. 2 provides thepresent arrangement, however, to more clearly illustrate that eachclient 260 a-n may be associated with one or more stakeholders 270 a-n,which may differ from any other client's 260 a-n stakeholders 270 a-n.

In another embodiment of the screening or monitoring system, a servercomputer system (real-time system 302—FIGS. 3 and 22) applies adepression assessment test to a human patient using a client device(portable device 312), by engaging the patient in an interactive spokenconversation and applying a composite model, that combines language,acoustic, and visual models, to a captured audiovisual signal of thepatient engaged in the dialogue. While the general subject matter of theassessment test may incorporate queries including subject matter similarto questions asked in standardized depression assessment tests such asthe PHQ-9, the assessment does not merely include analysis of answers tosurvey questions. In fact, the screening or monitoring system'scomposite model analyzes, in real time, the audiovisual signal of thepatient (i) to make the conversation more engaging for the patient and(ii) assess the patient's mental health.

While the latter goal is the primary goal, the former goal is asignificant factor in achieving the latter. Truthfulness of the patientin answering questions posed by the assessment test is critical inassessing the patient's mood. Real-time system 302 encourages honesty ofthe patient in a number of ways. First, the spoken conversation providesthe patient with less time to compose a response to a question ratherthan it would take to simply respond honestly to the question.

Second, the conversation feels, to the patient, more spontaneous andpersonal and is less annoying to the patient than an obviously genericquestionnaire. Accordingly, the spoken conversation does not induce orexacerbate resentment in the patient for having to answer aquestionnaire before seeing a doctor or other clinician. Third, thespoken conversation is adapted in progress to be responsive to thepatient, reducing the patient's annoyance with the assessment test and,in some situations, shortening the assessment test. Fourth, theassessment test as administered by real-time system 302 may rely more onnon-verbal aspects of the conversation and the patient than on theverbal content of the conversation to assess depression in the patient.

As shown in (FIG. 3), patient assessment system 300 includes real-timesystem 302, a modeling system 304, a clinical data server 306, a patientdevice 312, and a clinician device 314 that are connected to one anotherthough a wide area network (WAN) 310, that is the Internet in thisillustrative embodiment. Real-time system 302 is a server computersystem that administers the depression assessment test with the patientthrough patient device 312. Modeling system 304 is a server computersystem that combines a number of language, acoustic, and visual modelsto produce a composite model 2204 (FIG. 22), using clinical dataretrieved from clinical data server 306 and patient data collected frompast assessments to train composite model 2204. Clinical data server 306(FIG. 3) is a server computer system that makes clinical data of thepatient, including diagnoses, medication information, etc., available,e.g., to modeling system 304, in a manner that is compliant with HIPAA(Health Insurance Portability and Accountability Act of 1996) and/or anyother privacy and security policies and regulations such as GDPR and SOC2. Clinician device 314 is a client device that receives datarepresenting a resulting assessment regarding depression from real-timesystem 302.

End-to-End Nature of the Systems

The systems disclosed herein may provide medical care professionals witha prediction of a mental state of a patient. The mental state may bedepression, anxiety, or another mental condition. The systems mayprovide the medical care professionals with additional information,outside of the mental state prediction. The system may providedemographic information, such as age, weight, occupation, height,ethnicity, medical history, psychological history, and gender to medicalcare professionals via client devices, such as the client devices 260a-n of FIG. 2. The system may provide information from online systems orsocial networks to which the patient may be registered. The patient mayopt in, by setting permissions on his or her client device to providethis information before the screening or monitoring process begins. Thepatient may also be prompted to enter demographic information during thescreening or monitoring process. Patients may also choose to provideinformation from their electronic health records to medical careprofessionals. In addition, medical care professionals may interviewpatients during or after a screening or monitoring event to obtain thedemographic information. During registration for screening ormonitoring, patients may also enter information that specifies orconstraints their interests. For example, they may enter topics thatthey do and/or do not wish to speak about. In this disclosure, the terms“medical care provider” and “clinician” are used interchangeably.Medical care providers may be doctors, nurses, physician assistants,nursing assistants, clinical psychologists, social workers, technicians,or other health care providers.

A clinician may set up the mental health assessment with the patient.This may include choosing a list of questions for the system to ask thepatient, including follow-up questions. The clinician may add or removespecific questions from the assessment, or change an order in which thequestions are administered to the patient. The clinician may beavailable during the assessment as a proctor, in order to answer anyclarifying questions the patient may have.

The system may provide the clinician with the dialogue between itselfand the patient. This dialogue may be a recording of the screening ormonitoring process, or a text transcript of the dialogue. The system mayprovide a summary of the dialogue between itself and the patient, usingsemantic analysis to choose segments of speech that were most importantto predicting the mental state of the patient. These segments may beselected because they might be highly weighted in a calculation of abinary or scaled score indicating a mental state prediction, by example.The system may incorporate such a produced score into a summary reportfor the patient, along with semantic context taken from a transcript ofthe interview with the patient.

The system may additionally provide the clinician with a “word cloud” or“topic cloud” extracted from a text transcript of the patient's speech.A word cloud may be a visual representation of individual words orphrases, with words and phrases used most frequently designated usinglarger font sizes, different colors, different fonts, differenttypefaces, or any combination thereof. Depicting word or phrasefrequency in such a way may be helpful as depressed patients commonlysay particular words or phrases with larger frequencies thannon-depressed patients. For example, depressed patients may use words orphrases that indicate dark, black, or morbid humor. They may talk aboutfeeling worthless or feeling like failures, or use absolutist language,such as “always”, “never”, or “completely.” Depressed patients may alsouse a higher frequency of first-person singular pronouns (e.g., “I”,“me”) and a lower frequency of second- or third-person pronouns whencompared to the general population. The system may be able to train amachine learning algorithm to perform semantic analysis of word cloudsof depressed and non-depressed people, in order to be able to classifypeople as depressed or not depressed based on their word clouds. Wordcloud analysis may also be performed using unsupervised learning. Forexample, the system may analyze unlabeled word clouds and search forpatterns, in order to separate people into groups based on their mentalstates.

The systems described herein can output an electronic report identifyingwhether a patient is at risk of a mental or physiological condition. Theelectronic report can be configured to be displayed on a graphical userinterface of a user's electronic device. The electronic report caninclude a quantification of the risk of the mental or physiologicalcondition, e.g., a normalized score. The score can be normalized withrespect the entire population or with respect to a sub-population ofinterest. The electronic report can also include a confidence level ofthe normalized score. The confidence level can indicate the reliabilityof the normalized score (i.e., the degree to which the normalized scorecan be trusted).

The electronic report can include visual graphical elements. Forexample, if the patient has multiple scores from multiple screening ormonitoring sessions that occurred at several different times, the visualgraphical element may be a graph that shows the progression of thepatient's scores over time.

The electronic report may be output to the patient or a contact personassociated with the patient, a healthcare provider, a healthcare payer,or another third-party. The electronic report can be outputsubstantially in real-time, even while the screening, monitoring, ordiagnosis is ongoing. In response to a change in the normalized score orconfidence during the course of the screening, monitoring, or diagnosis,the electronic report can be updated substantially in real-time and bere-transmitted to the user.

In some cases, the electronic report may include one or more descriptorsabout the patient's mental state. The descriptors can be a qualitativemeasure of the patient's mental state (e.g., “mild depression”).Alternatively or additionally, the descriptors can be topics that thepatient mentioned during the screening. The descriptors can be displayedin a graphic, e.g., a word cloud.

The models described herein may be optimized for a particular purpose orbased on the entity that may receive the output of the system. Forexample, the models may be optimized for sensitivity in estimatingwhether a patient has a mental condition. Healthcare payers such asinsurance companies may prefer such models so that they can minimize thenumber of insurance payments made to patients with false positivediagnoses. In other cases, the models may be optimized for specificityin estimating whether a patient has a mental condition. Healthcareproviders may prefer such models. The system may select the appropriatemodel based on the stakeholder to which the output will be transmitted.After processing, the system can transmit the output to the stakeholder.

The models described herein can alternatively be tuned or configured toprocess speech and other data according to a desired level ofsensitivity or a desired level of specificity determined by a clinician,healthcare provider, insurance company, or government regulated body.

Use Cases

The system may be used to monitor teenagers for depression. The systemmay perform machine learning analysis on groups of teenagers in order todetermine voice-based biomarkers that may uniquely classify teenagers asbeing at risk for depression. Depression in teenagers may have differentcauses than in adults. Hormonal changes may also introduce behaviors inteenagers that would be atypical for adults. A system for screening ormonitoring teenagers would need to employ a model tuned to recognizethese unique behaviors. For example, depressed or upset teenagers may bemore prone to anger and irritability than adults, who may withdraw whenupset. Thus, questions from assessments may elicit different voice-basedbiomarkers from teenagers than adults. Different screening or monitoringmethods may be employed when testing teenagers for depression, orstudying teenagers' mental states, than are employed for screening ormonitoring adults. Clinicians may modify assessments to particularlyelicit voice-based biomarkers specific to depression in teenagers. Thesystem may be trained using these assessments, and determine ateenager-specific model for predicting mental states. Teenagers mayfurther be segmented by household (foster care, adoptive parent(s), twobiological parents, one biological parent, care by guardian/relative,etc.), medical history, gender, age (old vs. young teenager), andsocioeconomic status, and these segments may be incorporated into themodel's predictions.

The system may also be used to monitor the elderly for depression anddementia. The elderly may also have particular voice-based biomarkersthat younger adults may not have. For example, the elderly may havestrained or thin voices, owing to aging. Elderly people may exhibitaphasia or dysarthria, have trouble understanding survey questions,follow-ups, or conversational speech, and may use repetitive language.Clinicians may develop, or algorithms may be used to develop, surveysfor eliciting particular voice-based biomarkers from elderly patients.Machine learning algorithms may be developed to predict mental states inelderly patients, specifically, by segmenting patients by age.Differences may be present in elderly patients from differentgenerations (e.g., Greatest, Silent, Boomer), who may have differentviews on gender roles, morality, and cultural norms. Models may betrained to incorporate elder age brackets, gender, race, socioeconomicstatus, physical medical conditions, and family involvement.

The system may be used to test airline pilots for mental fitness.Airline pilots have taxing jobs, and may experience large amounts ofstress and fatigue on long flights. Clinicians or algorithms may be usedto develop screening or monitoring methods for these conditions. Forexample, the system may base an assessment off of queries similar tothose tested in the Minnesota Multiphasic Personality Inventory (MMPI)and MMPI-2.

The system may also be used to screen military personnel for mentalfitness. For example, the system may implement an assessment that usesqueries with similar subject matter to those asked on the Primary CarePost-Traumatic Stress Disorder for Diagnostic and Statistical Manual ofMental Disorders (DSM)-5 (PC-PTSD-5) to test for PTSD. In addition toPTSD, the system may screen military personnel for depression, panicdisorder, phobic disorder, anxiety, and hostility. The system may employdifferent surveys to screen military personnel pre- and post-deployment.The system may segment military personnel by segmenting for occupation,and segment military personnel by branch, officer or enlisted, gender,age, ethnicity, number of tours/deployments, marital status, medicalhistory, and other factors.

The system may be used to evaluate prospective gun buyers, e.g., byimplementing background checks. Assessments may be designed, byclinicians or algorithmically, to evaluate prospective buyers for mentalfitness for owning a firearm. The survey may have a requirement todetermine, using questions and follow-up questions, if a prospective gunbuyer would be able to be certified as a danger to him or herself orothers, by a court or other authority.

Health screening or monitoring server 102 (FIG. 1A) is shown in greaterdetail in FIG. 4 and in even greater detail in FIG. 22. As shown in FIG.4, health screening or monitoring server 102 includes interactive healthscreening or monitoring logic 402 and health care management logic 408.In addition, health screening or monitoring server 102 includesscreening or monitoring system data store 410 and model repository 416.

Each of the components of health screening or monitoring server 102 isherein described more completely. Briefly, interactive health screeningor monitoring logic 402 conducts an interactive conversation with thesubject patient and estimates one or more health states of the patientby application of the models of runtime model server 504 (FIG. 18) toaudiovisual signals representing responses by the patient. In thisillustrative embodiment, interactive health screening or monitoringlogic 402 (FIG. 4) may also operate in a passive listening mode,observing the patient outside the context of an interactive conversationwith health screening or monitoring server 102, e.g., during a sessionwith a health care clinician, and estimating a health state of thepatient from such observation. Health care management logic 408 makesexpert recommendations in response to health state estimations ofinteractive health screening or monitoring logic 402. Screening systemdata store 410 stores and maintains all user and patient data neededfor, and collected by, screening or monitoring in the manner describedherein.

The conversational context of the health screening or monitoring systemmay improve one or more performance metrics associated with one or moremachine learning algorithms used by the system. These metrics mayinclude metrics such as an F1 score, an area under the curve (AUC), asensitivity, a specificity, a positive predictive value (PPV), and anequal error rate.

It should be appreciated that the behavior of health screening ormonitoring server 102 described herein may be distributed acrossmultiple computer systems. For example, in some illustrativeembodiments, real-time, interactive behavior of health screening ormonitoring server 102 (e.g., interactive screening or monitoring serverlogic 502 and runtime model server logic 504 described below) isimplemented in one or more servers configured to handle large amounts oftraffic through WAN 110 (FIG. 1A) and computationally intensive behaviorof health screening or monitoring server 102 (e.g., health caremanagement logic 408 and model training logic 506) is implemented in oneor more other servers configured to efficiently perform highly complexcomputation. Distribution of various loads carried by health screeningor monitoring server 102 may be distributed among multiple computersystems.

Interactive health screening or monitoring logic 402 is shown in greaterdetail in FIG. 5. Interactive health screening or monitoring logic 402includes interactive screening or monitoring server logic 502, runtimemodel server logic 504, and model training logic 506. Interactivescreening or monitoring server logic 502 conducts an interactivescreening or monitoring conversation with the human patient; runtimemodel server logic 504 uses and adjusts a number of machine learningmodels to concurrently evaluate responsive audiovisual signals of thepatient; and model training logic 506 trains models of runtime modelserver logic 504.

Interactive screening or monitoring server logic 502 is shown in greaterdetail in FIG. 6 and includes generalized dialogue flow logic 602 andinput/output (I/O) logic 604. I/O logic 604 affects the interactivescreening or monitoring conversation by sending audiovisual signals to,and receiving audiovisual signals from, patient device 112. I/O logic604 receives data from generalized dialogue flow logic 602 thatspecifies questions to be asked of the patient and sends audiovisualdata representing those questions to patient device 112. In embodimentsin which the interactive screening or monitoring conversation iseffected through PSTN 120 (FIG. 1), I/O logic 604 (i) sends anaudiovisual signal to patient device 112 by sending data to a human, orautomated, operator of call center 104 prompting the operator to ask aquestion in a telephone call with patient device 112 (or alternativelyby sending data to a backend automated dialog system destined forpatients) and (ii) receives an audiovisual signal from patient device112 by receiving an audiovisual signal of the interactive screening ormonitoring conversation forwarded by call center 104. I/O logic 604 alsosends at least portions of the received audiovisual signal of theinteractive screening or monitoring conversation to runtime model serverlogic 504 (FIG. 18) and model training logic 506 (FIG. 19).

The queries asked to patients, or questions, may be stored as nodes,while patient responses, collected as audiovisual signals, may be storedas edges. A screening or monitoring event, or set of screening ormonitoring events, for a particular patient, may be thereforerepresented as a graph. For example, different answers to differentfollow-up questions may be represented as multiple spokes connecting aparticular node to a plurality of other nodes. Different graphstructures for different patients may be used as training examples for amachine learning algorithm as another method of determining a mentalstate classification for a patient. Classification may be performed bydetermining similarities between graphs of, for example, depressedpatients. Equivalent questions, as discussed herein, may be labeled assuch within the graph. Thus, the graphs may also be studied and analyzedto determine idiosyncrasies in interpretations of different versions ofquestions by patients.

I/O logic 604 also receives results 1820 (FIG. 18) from runtime serverlogic 504 that represent evaluation of the audiovisual signal.Generalized dialogue flow logic 602 conducts the interactive screeningor monitoring conversation with the human patient. Generalized dialogueflow logic 602 determines what questions I/O logic 604 should ask of thepatient and monitors the reaction of the patient as represented inresults 1820. In addition, generalized dialogue flow logic 602determines when to politely conclude the interactive screening ormonitoring conversation.

Generalized dialogue flow logic 602 is shown in greater detail in FIG.5. Generalized dialogue flow logic 602 includes interaction controllogic generator 702. Interaction control logic generator 702 manages theinteractive screening or monitoring conversation with the patient bysending data representing dialogue actions to I/O logic 604 (FIG. 6)that direct the behavior of I/O logic 604 in carrying out theinteractive screening or monitoring conversation. Examples of dialogueactions include asking a question of the patient, repeating thequestion, instructing the patient, politely concluding the conversation,changing aspects of a display of patient device 112, and modifyingcharacteristics of the speech presented by the patient by I/O logic 604,i.e., pace, volume, apparent gender of the voice, etc.

Interaction control logic generator 702 customizes the dialogue actionsfor the patient. Interaction control logic generator 702 receives datafrom screening or monitoring data store 210 that represents subjectivepreferences of the patient and a clinical and social history of thepatient. In this illustrative embodiment, the subjective preferences areexplicitly specified by the patient, generally prior to any interactivescreening or monitoring conversation, and include such things as theparticular voice to be presented to the patient through I/O logic 604,default volume and pace of the speech generated by I/O logic 604, anddisplay schemes to be used within patient device 112.

The clinical and social history of the patient, in combination withidentified interests of the patient, may indicate that questions relatedto certain topics should be asked of the patient. Interaction controllogic generator 702 uses the patient's preferences and medical historyto set attributes of the questions to ask the patient.

Interaction control logic generator 702 receives data from runtime modelserver logic 504 that represents analytical results of responses of thepatient in the current screening or monitoring conversation. Inparticular, interaction control logic generator 702 receives datarepresenting analytical results of responses, i.e., results 1820 (FIG.18) of runtime model server logic 504 and patient and results metadatafrom descriptive model and analytics 1812 that facilitates properinterpretation of the analytical results. Interaction control logicgenerator 702 interprets the analytical results in the context of theresults metadata to determine the patient's current status.

History and state machine 720 tracks the progress of the screening ormonitoring conversation, i.e., which questions have been asked and whichquestions are yet to be asked. Question and dialogue action bank 710 isa data store that stores all dialogue actions that may be taken byinteraction control logic generator 702, including all questions thatmay be asked of the patient. In addition, history and state machine 720informs question and dialogue action bank 710 as to which question is tobe asked next in the screening or monitoring conversation.

Interaction control logic generator 702 receives data representing thecurrent state of the conversation and what questions are queued to beasked from history and state machine 720. Interaction control logicgenerator 702 processes the received data to determine the next actionto be taken by interactive screening or monitoring server logic 302 infurtherance of the screening or monitoring conversation. Once the nextaction is determined, interaction control logic generator 702 retrievesdata representing the action from question and dialogue action bank 710and sends a request to I/O logic 604 to perform the next action.

The overall conducting of the screening or monitoring conversation bygeneralized dialogue flow logic 602 is illustrated in logic flow diagram800 (FIG. 8). The logic flow diagram of FIG. 8 describes actions takenby components of the interaction engine in the block diagram of FIG. 28.In addition, the logic flow diagram of FIG. 8 is an instantiation of theprocess described in FIG. 14. In step 802, generalized dialogue flowlogic 602 selects a question or other dialogue action to initiate theconversation with the patient. Interaction control logic generator 702receives data from history and state machine 720 that indicates that thecurrent screening or monitoring conversation is in its initial state.Interaction control logic generator 702 receives data that indicates (i)subjective preferences of the patient and (ii) topics of relatively highpertinence to the patient. Given that information, interaction controllogic generator 702 selects an initial dialogue action with which toinitiate the screening or monitoring conversation. Examples of theinitial dialogue action may include (i) asking a commonconversation-starting question such as “can you hear me?” or “are youready to begin?”; (ii) asking a question from a predetermined scriptused for all patients; (iii) reminding the patient of a topic discussedin a previous screening or monitoring conversation with the patient andasking the patient a follow-up question on that topic; or (iv)presenting the patient with a number of topics from which to selectusing a user-interface technique on patient device 112. In step 802,interaction control logic generator 702 causes I/O logic 604 to carryout the initial dialogue action.

Loop step 804 and next step 816 define a loop in which generalizeddialogue flow logic 602 conducts the screening or monitoringconversation according to steps 806-814 until generalized dialogue flowlogic 602 determines that the screening or monitoring conversation iscompleted.

In step 806, interaction control logic generator 702 causes I/O logic604 to carry out the selected dialogue action. In the initialperformance of step 806, the dialogue action is selected in step 802. Insubsequent performances of step 806, the dialogue action is selected instep 814 as described below. In step 808, generalized dialogue flowlogic 602 receives an audiovisual signal of the patient's response tothe question. While processing according to logic flow diagram 800 isshown in a manner that suggests synchronous processing, generalizeddialogue flow logic 602 performs step 808 effectively continuouslyduring performance of steps 802-816 and processes the conversationasynchronously. The same is true for steps 810-814. In step 810, I/Ologic 604 sends the audiovisual signal received in step 808 to runtimemodel server logic 504, which processes the audiovisual signal in amanner described below. In step 812, I/O logic 604 of generalizeddialogue flow logic 602 receives multiplex data from runtime modelserver logic 504 and produces therefrom an intermediate score for thescreening or monitoring conversation so far.

As described above, the results data include analytical results data andresults metadata. I/O logic 604 (i) determines to what degree thescreening or monitoring conversation has completed screening ormonitoring for the target health state(s) of the patient, (ii)identifies any topics in the patient's response that warrant follow-upquestions, and (iii) identifies any explicit instructions from thepatient for modifying the screening or monitoring conversation. Examplesof the last include patient statements such as “can you speak louder?”,“can you repeat that?” or “what?”, and “please speak more slowly.” Instep 814, generalized dialogue flow logic 602 selects the next questionto ask the subject patient, along with other dialogue actions to beperformed by I/O logic 604, in the next performance of step 806. Inparticular, interaction control logic generator 702 (i) receivesdialogue state data from history and state machine 720 regarding thequestion to be asked next, (ii) receives intermediate results data fromI/O logic 604 representing evaluation of the patient's health state sofar, and (iii) receives patient preferences and pertinent topics.

Processing transfers through next step 816 to loop step 804. Generalizeddialogue flow logic 602 repeats the loop of steps 804-816 untilinteraction control logic generator 702 determines that the screening ormonitoring conversation is complete, at which point generalized dialogueflow logic 602 politely terminates the screening or monitoringconversation. The screening or monitoring conversation is complete when(i) all mandatory questions have been asked and answered by the patientand (ii) the measure of confidence in the score resulting from screeningor monitoring determined in step 812 is at least a predeterminedthreshold. It should be noted that confidence in the screening ormonitoring is not symmetrical.

The screening or monitoring conversation seeks to detect specific healthstates in the patient, e.g., depression and anxiety. If such states aredetected quickly, they're detected. However, absence of such states isnot assured by failing to detect them immediately. More generally,absence of proof is not proof of absence. Thus, generalized dialogueflow logic 602 finds confidence in early detection but not in earlyfailure to detect. Thus, health screening or monitoring server 102 (FIG.4) estimates the current health state, e.g., mood, of the patient usinga spoken conversation with the patient through patient device 112.Interactive screening or monitoring server logic 502 sends datarepresenting the resulting screening or monitoring of the patient to thepatient's doctor or other clinicians by sending the data to cliniciandevice 114. In addition, interactive screening or monitoring serverlogic 502 records the resulting screening or monitoring in screening ormonitoring system data store 410. A top priority of generalized dialogueflow logic 602 is to elicit speech from the patient that is highlyinformative with respect to the health state attributes for which healthscreening or monitoring server 102 screens the patient. For example, inthis illustrative embodiment, health screening or monitoring server 102screens most patients for depression and anxiety. The analysis performedby runtime model server logic 504 is most accurate when presented withpatient speech of a particular quality. In this context, speech qualityrefers to the sincerity with which the patient is speaking. Generallyspeaking, high quality speech is genuine and sincere, while poor qualityspeech is from a patient not engaged in the conversation or beingintentionally dishonest.

For example, if the patient does not care about the accuracy of thescreening or monitoring, but instead wants to answer all questions asquickly as possible to end the screening or monitoring as quickly aspossible, it is unlikely to reveal much about the patient's true health.Similarly, if the patient intends to control the outcome of thescreening or monitoring by giving false responses, not only are theresponses linguistically false but the emotional components of thespeech may be distorted or missing due to the disingenuous participationby the patient. There are a number of ways in which generalized dialogueflow logic 602 increases the likelihood that the patient's responses arerelatively highly informative. For example, generalized dialogue flowlogic 602 may invite the patient to engage interactive screening ormonitoring server logic 502 as an audio diary whenever the patient is soinclined. Voluntary speech by the patient whenever motivated tends to begenuine and sincere and therefore highly informative.

Generalized dialogue flow logic 602 may also select topics that arepertinent to the patient. These topics may include topics specific toclinical and social records of the patient and topics specific tointerests of the patient. Using topics of interest to the patient mayhave the negative effect of influencing the patient's mood. For example,asking the patient about her favorite sports team may cause thepatient's mood to rise or fall with the most recent news of the team.Accordingly, generalized dialogue flow logic 602 distinguisheshealth-relevant topics of interest to the patient from health-irrelevanttopics of interest to the patient. For example, questions related to anestranged relative of the patient may be health-relevant while questionsrelated to the patient's favorite television series are typically not.Adapting any synthetic voice to match the preferences of the patientmakes the screening or monitoring conversation more engaging for thepatient and therefore elicits more informative speech. In embodiments inwhich patient device 112 displays a video representation of a speaker,i.e., an avatar, to the patient, patient preferences include, inaddition to the preferred voice, physical attributes of the appearanceof the avatar.

When a patient has not specified preferences for a synthetic voice oravatar, generalized dialogue flow logic 602 may use a synthetic voiceand avatar chosen for the first screening or monitoring conversationand, in subsequent screening or monitoring conversations, change thesynthetic voice and avatar and compare the degree of informativeness ofthe patient's responses to determine which voice and avatar elicit themost informative responses. The voice and avatar chosen for the initialscreening or monitoring conversation may be chosen according to whichvoice and avatar tends to elicit the most informative speech among thegeneral population or among portions of the general population sharingone or more phenotypes with the patient. The manner in which theinformativeness of responses elicited by a question is determined isdescribed below.

To make the screening or monitoring conversation more interactive andengaging, generalized dialogue flow logic 602 inserts a syntheticbackchannel in the conversation. For example, generalized dialogue flowlogic 602 may utter “uh-huh” during short pauses in the patient's speechto indicate that generalized dialogue flow logic 602 is listening andinterested in what the patient has to say. Similarly, generalizeddialogue flow logic 602 may cause the video avatar to exhibit non-verbalbehavior (sometimes referred to as “body language”) to indicateattentiveness and interest in the patient.

Generalized dialogue flow logic 602 also selects questions that are ofhigh quality. Question quality is measured in the informativeness ofresponses elicited by the question. In addition, generalized dialogueflow logic 602 avoids repetition of identical questions in subsequentscreening or monitoring conversations, substituting equivalent questionswhen possible. The manner in which questions are determined to beequivalent to one another is described more completely below. Asdescribed above, question and adaptive action bank 710 (FIG. 5) is adata store that stores all dialogue actions that may be taken byinteraction control logic generator 702, including all questions thatmay be asked of the patient.

Question and adaptive action bank 710 is shown in greater detail in FIG.9.

Question and adaptive action bank 710 is shown in greater detail in FIG.7 and includes a number of question records 902 and a dialogue 912. Eachof question records 902 includes data representing a single questionthat may be asked of a patient. Dialogue 912 is a series of questions toask a patient in a spoken conversation with the patient. Each ofquestion records 902 includes a question body 904, a classification 906,a quality 908, and an equivalence 910. Question body 904 includes dataspecifying the substantive content of the question, i.e., the sequenceof words to be spoken to the patient to effect asking of the question.Topic 906 includes data specifying a hierarchical topic category towhich the question belongs. Categories may correlate to (i) specifichealth diagnoses such as depression, anxiety, etc.; (ii) specificsymptoms such as insomnia, lethargy, general disinterest, etc.; and/or(iii) aspects of a patient's treatment such as medication, exercise,etc. Quality 908 includes data representing the quality of the question.The quality of the question is a measure of informativeness of responseselicited by the question. Equivalence 910 is data identifying one ormore other questions in question records 902 that are equivalent to thequestion represented by this particular one of question records 902. Inthis illustrative embodiment, only questions of the same topic 906 maybe considered equivalent. In an alternative embodiment, any questionsmay be considered equivalent regardless of classification. Dialogue 912includes an ordered sequence of questions 914A-N, each of whichidentifies a respective one of question records 902 to ask in a spokenconversation with the patient. In this illustrative embodiment, thespoken conversation begins with twenty (20) preselected questions andmay include additional questions as necessary to produce a thresholddegree of confidence to conclude the conversation of logic flow diagram600 (FIG. 6). The preselected questions include, in order, five (5)open-ended questions of high quality, eight (8) questions of thestandard and known PHQ-8 screening or monitoring tool for depression,and the seven (7) questions of the standard and known GAD-7 screening ormonitoring tool for anxiety. In other examples, the questions may begenerated algorithmically. Dialogue 912 specifies these twenty (20)questions in this illustrative embodiment. As described above withrespect to step 814 (FIG. 10), interaction control logic generator 702determines the next question to ask the patient in step 814. Oneembodiment of step 814 is shown as logic flow diagram 1014 (FIG. 8). Instep 1002, interaction control logic generator 702 dequeues a questionfrom dialogue 912, treating the ordered sequence of questions 914A-N asa queue. History and state machine 720 keeps track of which of questions914A-N is next. If the screening or monitoring conversation is notcomplete according to the intermediate score and all of questions 914A-Nhave been processed in previous performances of step 1002 in the samespoken conversation, i.e., if the question queue is empty, interactioncontrol logic generator 702 selects questions from those of questionrecords 902 with the highest quality 908 and pertaining to topicsselected for the patient.

If interaction control logic generator 702 selects multiple questions,interaction control logic generator 702 may select one as the dequeuedquestion randomly with each question weighted by its quality 908 and itscloseness to suggested topics.

In step 1004 (FIG. 10), interaction control logic generator 702 collectsall equivalent questions identified by equivalence 910 (FIG. 9) for thequestion dequeued in step 1002. In step 1006, interaction control logicgenerator 702 selects a question from the collection of equivalentquestions collected in step 1004, including the question dequeued instep 1002 itself. Interaction control logic generator 702 may select oneof the equivalent questions randomly or using information about priorinteractions with the patient, e.g., to select the one of the equivalentquestions least recently asked of the patient. Interaction control logicgenerator 702 processes the selected question as the next question inthe next iteration of the loop of steps 804-816 (FIG. 8). The use ofequivalent questions is important. The quality of a question, i.e., thedegree to which responses the question elicits are informative inruntime model server logic 504, decreases for a given patient over time.In other words, if a given question is asked to a given patientrepeatedly, each successive response by the patient becomes lessinformative than it was in all prior asking's of the question. In asense, questions become stale over time. To keep questions fresh, i.e.,soliciting consistently informative responses over time, a givenquestion is replaced with an equivalent, but different, question in asubsequent conversation. However, the measurement of equivalence may beaccurate for comparison of responses to equivalent questions over timeto be consistent.

Thus, two important concepts of questions in generalized dialogue flowlogic 602 (FIG. 7) are question quality and question equivalence.Question quality and question equivalence are managed by questionmanagement logic 916, which is shown in greater detail in FIG. 11.Question management logic 916 includes question quality logic 1102,which measures a question's quality, and question equivalence logic1104, which determines whether two (2) questions are equivalent in thecontext of health screening or monitoring server 102. Question qualitylogic 1102 includes a number of metric records 1106 and metricaggregation logic 1112. To measure the quality of a question, i.e., tomeasure how informative are the responses elicited by the question,question quality logic 1102 uses a number of metrics to be applied to aquestion, each of which results in a numeric quality score for thequestion and each of which is represented by one of metric records 1106.Each of metric records 1106 represents a single metric for measuringquestion quality and includes metric metadata 1108 and quantificationlogic 1110. Metric metadata 1108 represents information about the metricof metric record 1106. Quantification logic 1110 defines the behavior ofquestion quality logic 1102 in evaluating a question's quality accordingto the metric of metric record 1106.

The following are examples of metrics that may be applied by questionquality logic 1102 to measure the quality of various questions: (i) thelength of elicited responses in terms of a number of words; (ii) thelength of elicited responses in terms of duration of the responsiveutterance; (iii) a weighted word score; (iv) an amount of acousticenergy in elicited responses; and (v) “voice activation” in theresponses elicited by the question. Each is described in turn.

In a metric record 1106 representing a metric of the length of elicitedresponses in terms of a number of words, quantification logic 1110retrieves all responses to a given question from screening or monitoringsystem data store 410 (FIG. 4) and uses associated results data fromscreening or monitoring system data store 410 to determine the number ofwords in each of the responses. Quantification logic 1110 quantifies thequality of the question as a statistical measure of the number of wordsin the responses, e.g., a statistical mean thereof.

With respect to the length of elicited responses in terms of duration ofthe responsive utterance, the duration of elicited responses may bemeasured in a number of ways. In one, the duration of the elicitedresponse is simply the elapsed duration, i.e., the entire duration ofthe response as recorded in screening or monitoring system data store410. In another, the duration of the elicited response is the elapsedduration less pauses in speech. In yet another, the duration of theelicited response is the elapsed duration less any pause in speech atthe end of the response.

In a metric record 1106 (FIG. 11) representing a metric of the durationof elicited responses, quantification logic 1110 retrieves all responsesto a given question from screening or monitoring system data store 410(FIG. 4) and determines the duration of those responses. Quantificationlogic 1110 (FIG. 11) quantifies the quality of the question as astatistical measure of the duration of the responses, e.g., astatistical mean thereof.

With respect to a weighted word score, semantic models of NLP model 1806(FIG. 18) estimate a patient's health state from positive and/ornegative content of the patient's speech. The semantic models correlateindividual words and phrases to specific health states the semanticmodels are designed to detect. In a metric record 1106 (FIG. 11)representing a metric of a weighted word score, quantification logic1110 retrieves all responses to a given question from collected patientdata 410 (FIG. 5) and uses the semantic models to determine correlationof each word of each response to one or more health states. Anindividual response's weighted word score is the statistical mean of thecorrelations of the weighted word scores. Quantification logic 1110quantifies the quality of the question as a statistical measure of theweighted word scores of the responses, e.g., a statistical mean thereof.

With respect to an amount of acoustic energy in elicited responses,runtime model server logic 504 (FIG. 18) estimates a patient's healthstate from pitch and energy of the patient's speech as described below.How informative speech is to the various models of runtime model serverlogic 504 is directly related to how emotional the speech is. In ametric record 1106 (FIG. 11) representing a metric of an amount ofacoustic energy, quantification logic 1110 retrieves all responses to agiven question from screening or monitoring system data store 410 (FIG.4) and uses response data from runtime model server logic 504 todetermine an amount of energy present in each response. Quantificationlogic 1110 quantifies the quality of the question as a statisticalmeasure of the measured acoustic energy of the responses, e.g., astatistical mean thereof.

With respect to “voice activation” in the responses elicited by thequestion, the quality of a question is a measure of how similarresponses to the question are to utterances recognized by runtime models1802 (FIG. 18) as highly indicative of a health state that runtimemodels 1802 are trained to recognize. In a metric record 1106 (FIG. 11)representing a metric of voice activation, quantification logic 1110determines how similar deep learning machine features for all responsesto a given question are to deep learning machine features for healthscreening or monitoring server 102 as a whole.

Deep learning machine features are known but are described hereinbriefly to facilitate understanding and appreciation of the presentinvention. Deep learning is a sub-science of machine learning in that adeep learning machine is a machine learning machine, i.e., learningmachine, that learns for itself how to distinguish one thing representedin data from another thing represented in data. The following is asimple example to illustrate the distinction.

Consider an ordinary (not deep) learning machine that is configured torecognize the representation of a cat in image data. Such a learningmachine is typically a computer process with multiple layers of logic.One layer is manually configured to recognize contiguous portions of animage with transitions from one color to another (e.g., light to dark,red to green, etc.). This is commonly referred to as edge detection. Asubsequent layer receives data representing the recognized edges and ismanually configured to recognize edges that join together to defineshapes. A final layer receives data representing shapes and is manuallyconfigured to recognize a symmetrical grouping of triangles (cat's ears)and dark regions (eyes and nose). Other layers may be used between thosementioned here.

In machine learning, the data received as input to any step in thecomputation, including intermediate results from other steps in thecomputation, are called features. The results of the learning machineare called labels. In this illustrative example, the labels are “cat”and “no cat”.

This manually configured learning machine may work reasonably well butmay have significant shortcomings. For example, recognizing thesymmetrical grouping of shapes might not recognize an image in which acat is represented in profile. In a deep learning machine, the machineis trained to recognize cats without manually specifying what groups ofshapes represent a cat. The deep learning machine may utilize manuallyconfigured features to recognize edges, shapes, and groups of shapes,however these are not a required component of a deep learning system.Features in a deep learning system may be learned entirely automaticallyby the algorithm based on the labeled training data alone.

Training a deep learning machine to recognize cats in image data can,for example, involve presenting the deep learning machine with numerous,preferably many millions of, images and associated knowledge as towhether each image includes a cat, i.e., associated labels of “cat” or“no cat”. For each image received in training, the last, automaticallyconfigured layer of the deep learning machine receives data representingnumerous groupings of shapes and the associated label of “cat” or “nocat”. Using statistical analysis and conventional techniques, the deeplearning machine determines statistical weights to be given each type ofshape grouping, i.e., each feature, in determining whether a previouslyunseen image includes a cat.

These trained, i.e., automatically generated, features of the deeplearning machine will likely include the symmetrical grouping of shapesmanually configured into the learning machine as described above.However, these features will also likely include shape groupings andcombinations of shape groupings not thought of by human programmers.

In measuring the quality of a question, the features of the constituentmodels of runtime model server logic 504 (FIG. 18) specify precisely thetype of responses that indicate a health state that the constituentmodels of runtime model server logic 504 are configured to recognize.Thus, in evaluating the quality of a question, these features representan exemplary feature set. To measure the quality of a question usingthis metric, quantification logic 1110 (FIG. 11) retrieves all responsesto the question from screening or monitoring system data store 410 anddata representing the diagnoses associated with those responses andtrains runtime models 1802 and model repository 416 using thoseresponses and associated data.

In training runtime models 1802 and model repository 416, the deeplearning machine develops a set of features specific to the questionbeing measured and the determinations to be made by the trained models.Quantification logic 1110 measures similarity between the feature setspecific to the question and the exemplary feature set in a mannerdescribed below with respect to question equivalence logic 1104.

As described above, interaction control logic generator 702 (FIG. 7)uses quality 908 (FIG. 9) of various questions in determining whichquestion(s) to ask a particular patient. To provide a comprehensivemeasure of quality of a question to store in quality 908 (FIG. 9),metric aggregation logic 1112 (FIG. 11) aggregates the various measuresof quality according to metric records 1106. The manner in whichaggregation logic 1112 aggregates the measures of quality for a givenquestion is illustrated by logic flow diagram 1200 (FIG. 12).

Loop step 1202 and next step 1210 define a loop in which metricaggregation logic 1112 processes each of metric records 1106 accordingto steps 1204-1208. The particular one of metric records 1106 processedin an iteration of the loop of steps 1202-1210 is sometimes referred toas “the subject metric record”, and the metric represented by thesubject metric record is sometimes referred to as “the subject metric.”In step 1204, metric aggregation logic 1112 evaluates the subjectmetric, using quantification logic 1110 of the subject metric record andall responses in screening or monitoring system data store 410 (FIG. 4)to the subject question. In test step 1206 (FIG. 12), metric aggregationlogic 1110 determines whether screening or monitoring system data store410 includes a statistically significant sample of responses to thesubject question by the subject patient. If so, metric aggregation logic1110 evaluates the subject metric using quantification logic 1110 andonly data corresponding to the subject patient in screening ormonitoring system data store 410 in step 1208. Conversely, if collectedpatient data 410 does not include a statistically significant sample ofresponses to the subject question by the subject patient, metricaggregation logic 1112 skips step 1208. Thus, metric aggregation logic1112 evaluates the quality of a question in the context of the subjectpatient to the extent screening or monitoring system data store 410contains sufficient data corresponding to the subject patient.

After steps 1206-1208, processing transfers through next step 1210 toloop step 1202 and metric aggregation logic 1110 processes the nextmetric according to the loop of steps 1202-1210. Once all metrics havebeen processed according to the loop of steps 1202-1210, processingtransfers to step 1212 in which metric aggregation logic 1110 aggregatesthe evaluated metrics from all performances of steps 1204 and 1206 intoa single measure of quality and stores data representing that measure ofquality in quality 908. In this illustrative embodiment, metric metadata1108 stores data specifying how metric aggregation logic 1112 is toinclude the associated metric in the aggregate measure in step 1212. Forexample, metric metadata 1108 may specify a weight to be attributed tothe associated metric relative to other metrics.

After step 1212 (FIG. 12), processing according to logic flow diagram1200 completes.

As described above, equivalence 910 for a given question identifies oneor more other questions in question records 902 that are equivalent tothe given question. Whether two questions are equivalent is determinedby question equivalence logic 1104 (FIG. 11) by comparing similaritybetween the two questions to a predetermined threshold. The similarityhere is not how similar the words and phrasing of the sentences are butinstead how similarly models of runtime model server 504 and modelrepository 416 sees them. The predetermined threshold is determinedempirically. Question equivalence logic 1104 measures the similaritybetween two questions in a manner illustrated by logic flow diagram 1300(FIG. 13).

Loop step 1302 and next step 1306 define a loop in which questionequivalence logic 1304 processes each of metric records 1106 accordingto step 1304. The particular one of metric records 1306 processed in aniteration of the loop of steps 1302-1306 is sometimes referred to as“the subject metric record”, and the metric represented by the subjectmetric record is sometimes referred to as “the subject metric.” In step1304, question equivalence logic 1104 evaluates the subject metric foreach of the two questions. Once all metrics have been processedaccording to the loop of steps 1302-1106, processing by questionequivalence logic 1104 transfers to step 1308.

In step 1308, question equivalence logic 1104 combines the evaluatedmetrics for each question into a respective multi-dimensional vector foreach question.

In step 1310, question equivalence logic 1104 normalizes both vectors tohave a length of 1.0. In step 1312, question equivalence logic 1104determines an angle between the two normalized vectors.

In step 1314, the cosine of the angle determined in step 1312 isdetermined by question equivalence logic 1104 to be the measuredsimilarity between the two questions.

Since the vectors are normalized to a length of 1.0, the similaritybetween two questions ranges from −1.0 to 1.0, 1.0 being perfectlyequivalent. In this illustrative embodiment, the predetermined thresholdis 0.98 such that two questions have a measured similarity of at least0.98 are considered equivalent and are so represented in equivalence 910(FIG. 9) for both questions.

In addition, since the comparison between questions is not comparison ofa single value but instead a comparison of multi-dimensional vectors,two questions are equivalent, not if only similar in general, but ifsimilar in most or every way measured.

In another embodiment (FIG. 3), assessment test administrator 2202 (FIG.22) administers a depression assessment test to the subject patient byconducting an interactive spoken conversation with the subject patientthrough patient device 312. The manner in which assessment testadministrator 2202 does so is illustrated in logic flow diagram 1400(FIG. 14). The test administrator 2202 may be a computer programconfigured to questions to the patient. The questions may bealgorithmically generated questions. The questions may be generated by,for example, a natural language processing (NLP) algorithm. Examples ofNLP algorithms are semantic parsing, sentiment analysis, vector-spacesemantics, and relation extraction. In some embodiments, the methodsdescribed herein may be able to generate an assessment without requiringthe presence or intervention of a human clinician. In other embodiments,the methods described herein may be able to be used to augment orenhance clinician-provided assessments, or aid a clinician in providingan assessment. The assessment may include queries containing subjectmatter that has been adapted or modified from screening or monitoringmethods, such as the PHQ-9 and GAD-7 assessments. The assessment hereinmay not merely use the questions from such surveys verbatim, but mayadaptively modify the queries based at least in part on responses fromsubject patients.

In step 1402, assessment test administrator 2202 optimizes the testingenvironment. Step 1402 is shown in greater detail in logic flow diagram1402 (FIG. 15).

In step 1502, assessment test administrator 2202 initiates the spokenconversation with the subject patient. In this illustrative embodiment,assessment test administrator 2202 initiates a conversation by askingthe patient the initial question of the assessment test. The initialquestion is selected in a manner described more completely below. Theexact question asked isn't particularly important. What is important isthat the patient responds with enough speech that assessment testadministrator 2202 may evaluate the quality of the video and audiosignal received from patient device 312.

Assessment test administrator 2202 receives and processes audiovisualdata from patient device 312 throughout the conversation. Loop step 1504and next step 1510 define a loop in which assessment test administrator2202 processes the audiovisual signal according to steps 1506-1508 untilassessment test administrator 2202 determines that the audiovisualsignal is of high quality or at least of adequate quality to provideaccurate assessment.

In step 1506, assessment test administrator 2202 evaluates the qualityof the audiovisual signal received from patient device 312. Inparticular, assessment test administrator 2202 measures the volume ofspeech, the clarity of the speech, and to what degree the patient's faceand, when available, body is visible.

In step 1508, assessment test administrator 2202 reports the evaluationto the patient. In particular, assessment test administrator 2202generates an audiovisual signal that represents a message to be playedto the patient through patient device 312. If the audiovisual signalreceived from patient device 312 is determined by assessment testadministrator 2202 to be of inadequate quality, the message asks thepatient to adjust her environment to improve the signal quality. Forexample, if the audio portion of the signal is poor, the message may be“I'm having trouble hearing you. may you move the microphone closer toyou or find a quieter place?” If the patient's face and, when available,body isn't clearly visible, the message may be “I can't see your face(and body). may you reposition your phone so I may see you?” After step1508, processing by assessment test administrator 2202 transfers throughnext step 1510 to loop step 1504 and assessment test administrator 2202continues processing according to the loop of steps 1504-1510 until thereceived audiovisual is adequate or is determined to be as good as itwill get for the current assessment test. It is preferred thatsubsequent performances of step 1508 are responsive to any speech by thepatient. For example, the patient may attempt to comply with a messageto improve the environment with the question, “Is this better?” The nextmessage sent in reporting of step 1508 should include an answer to thepatient's question. As described herein, composite model 2204 includes alanguage model component, so assessment test administrator 2202necessarily performs speech recognition.

When the received audiovisual is adequate or is determined to be as goodas it will get for the current assessment test, processing by assessmenttest administrator 2202 according to the loop of steps 1504-1510completes. In addition, processing according to logic flow diagram 1402,and therefore step 1402 (FIG. 14), completes.

Loop step 1404 and next step 1416 define a loop in which assessment testadministrator 2202 conducts the spoken conversation of the assessmenttest according to steps 1406-1414 until assessment test administrator2202 determines that the assessment test is completed.

In step 1406, assessment test administrator 2202 asks a question of thepatient in furtherance of the spoken conversation. In this illustrativeembodiment, assessment test administrator 2202 uses a queue of questionsto ask the patient, and that queue is sometimes referred to herein asthe conversation queue. In the first performance of step 1406, the queuemay be prepopulated with questions to be covered during the assessmenttest. In general, these questions cover the same general subject mattercovered by currently used written assessment tests such as the PHQ-9 andGAD-7. However, while the questions in those tests are intentionallydesigned to elicit extremely short and direct answers, assessment testadministrator 2202 may require more audio and video than provided byone-word answers. Accordingly, it is preferred that the initially queuedquestions be more open-ended.

In this illustrative embodiment, the initial questions pertain to thetopics of general mood, sleep, and appetite. An example of an initialquestion pertaining to sleep is question 1702 (FIG. 17): “How have youbeen sleeping recently?0” This question is intended to elicit a sentenceor two from the patient to thereby provide more audio and video of thepatent than would ordinarily be elicited by a highly directed question.

In step 1408 (FIG. 14), assessment test administrator 2202 receives anaudiovisual signal of the patient's response to the question. Whileprocessing according to logic flow diagram 1400 is shown in a mannerthat suggests synchronous processing, assessment test administrator 2202performs step 1408 effectively continuously during performance of steps1402-1416 and processes the conversation asynchronously. The same istrue for steps 1410-1414.

In step 1410, assessment test administrator 2202 processes theaudiovisual signal received in step 1408 using composite model 2204. Instep 1412, assessment test administrator 2202 produces an intermediatescore for the assessment test according to the audiovisual signalreceived so far.

In step 1414, assessment test administrator 2202 selects the nextquestion to ask the subject patient in the next performance of step1406, and processing transfers through next step 1416 to loop step 1404.Step 1414 is shown in greater detail as logic flow diagram 1414 (FIG.16). In addition, FIG. 16 may be construed to follow from step 814 fromFIG. 8.

In step 1602, assessment test administrator 2202 identifies significantelements in the patient's speech. In particular, assessment testadministrator 2202 uses language portions of composite model 2204 toidentify distinct assertions in the portion of the audiovisual signalreceived after the last question asked in step 1406 (FIG. 14). Thatportion of the audiovisual signal is sometimes referred to herein as“the patient's response” in the context of a particular iteration of theloop of steps 1604-1610.

An example of a conversation conducted by assessment test administrator2202 of real-time system 302 and patient device 312 is shown in (FIG.17). It should be appreciated that conversation 1700 is illustrativeonly. The particular questions to ask, which parts of the patient'sresponse are significant, and the depth to which any topic is followedis determined by the type information to be gathered by assessment testadministrator 2202 and is configured therein. In step 1702, assessmenttest administrator 2202 asks the question, “How have you been sleepingrecently?” The patient's response is “Okay . . . I've been havingtrouble sleeping lately. I have meds for that. They seem to help.” Instep 1602, assessment test administrator 2202 identifies three (3)significant elements in the patient's response: (i) “trouble sleeping”suggests that the patient has some form of insomnia or at least thatsleep is poor; (ii) “I have meds” suggests that the user is takingmedication; and (iii) “They seem to help” suggests that the medicationtaken by the user is effective. In the illustrative example ofconversation 1700, each of these significant elements is processed byassessment test administrator 2202 in the loop of steps 1604-1610.

Loop step 1604 and next step 1610 define a loop in which assessment testadministrator 2202 processes each significant element of the patient'sanswer identified in step 1602 according to steps 1606-1608. In thecontext of a given iteration of the loop of steps 1604-1610, theparticular significant element processed is sometimes referred to as“the subject element.” In step 1606, assessment test administrator 2202processes the subject element, recording details included in the elementand identifying follow-up questions. For example, in conversation 1700(FIG. 17), assessment test administrator 2202 identifies three (3)topics for follow-up questions for the element of insomnia: (i) type ofinsomnia (initial, middle, or late), (ii) the frequency of insomniaexperienced by the patient, and (iii) what medication if any the patientis taking for the insomnia.

In step 1608, assessment test administrator 2202 enqueues any follow-upquestions identified in step 1606.

After step 1608, processing by assessment test administrator 2202transfers through next step 1610 to loop step 1604 until assessment testadministrator 2202 has processed all significant elements of thepatient's response according to the loop of steps 1604-1610. Onceassessment test administrator 2202 has processed all significantelements of the patient's response according to the loop of steps1604-1610, processing transfers from loop step 1604 to step 1612.

In the illustrative context of conversation 1700 (FIG. 17), the state ofthe conversation queue is as follows. FIG. 17 shows a particularinstantiation of a conversation proceeding between the system and apatient. The queries and replies disclosed herein are exemplary andshould not be construed as being required to follow the sequencedisclosed in FIG. 17. In the processing the response element ofinsomnia, assessment test administrator 2202 identifies and enqueuesfollow-up topics regarding the type insomnia and any medication takenfor the insomnia.

In processing the response element of medication in the patient'sresponse, assessment test administrator 2202 observes that the patientis taking medication. In step 1606, assessment test administrator 2202records that fact and, identifying a queued follow-up question regardingmedication for insomnia, processes the medication element as responsiveto the queued question.

In step 1608 for the medication element, assessment test administrator2202 enqueues follow-up questions regarding the particular medicine anddosage used by the patient and its efficacy as shown in step 1708.

In this illustrative embodiment, questions in the conversation queue arehierarchical. In the hierarchy, each follow-up question is a child ofthe question for which the follow-up question follows up. The latterquestion is the parent of the follow-up question. In dequeuing questionsfrom the conversation queue, assessment test administrator 2202implements a pre-order depth-first walk of the conversation queuehierarchy. In other words, all child questions of a given question areprocessed before processing the next sibling question. In conversationalterms, all follow-up questions of a given question are processed beforeprocessing the next question at the same level, recursively. In thecontext of conversation 1700, assessment test administrator 2202processes all follow-up questions of the type of insomnia beforeprocessing the questions of frequency and medication and any of theirfollow-up questions. This is the way conversations happennaturally-staying with the most recently discussed topic until completebefore returning to a previously discussed topic.

In addition, the order in which sibling questions are processed byassessment test administrator 2202 may be influenced by the responses ofthe patient. In this illustrative example, a follow-up questionregarding the frequency of insomnia precedes the follow-up questionregarding medication. However, when processing the element regardingmedication in step 1606, assessment test administrator 2202 changes thesequence of follow-up questions such that the follow-up questionregarding medication is processed prior to processing the follow-upquestion regarding insomnia frequency. Since medication was mentioned bythe patient, we'll discuss that before adding new subtopics to theconversation. This is another way in which assessment test administrator2202 is responsive to the patient.

In processing the response element of medication efficacy (i.e., “Theyseem to help.”), assessment test administrator 2202 records that themedication is moderately effective. Seeing that the conversation queueincludes a question regarding the efficacy of medication, assessmenttest administrator 2202 applies this portion of the patient's responseas responsive to the queued follow-up question in step 1710.

In step 1612, assessment test administrator 2202 dequeues the nextquestion from the conversation queue and processing according to logicflow diagram 1414, and therefor step 1414, completes and theconversation continues. Prior to returning to discussion of (FIG. 14),it is helpful to consider additional performances of step 1414, andtherefore logic flow diagram 1414, in the context of illustrativeconversation 1700. The question dequeued as the next question in thisillustrative embodiment asks about the patient's insomnia, trying todiscern the type of insomnia. It is appreciated that conventionalthinking as reflected in the PHQ-9 and GAD-7 is that the particular typeof sleep difficulties experienced by a test subject isn't as strong anindicator of depression as the mere fact that sleep is difficult.However, delving more deeply into a topic of conversation has a numberof beneficial consequences. Most significantly is that the user isencouraged to provide more speech for more accurate assessment of thepatient's state. In addition, by asking questions about something thepatient has just said suggests that assessment test administrator 2202is interested in the patient personally and, by earning good will fromthe patient, makes the patient more likely to be honest, both in speechand behavior.

In the illustrative example of conversation 1700, the next question isrelated to the type of insomnia. The question is intentionally asopen-ended as possible while still targeted at specific information:“Have you been waking up in the middle of the night?” See question 1712.While this question may elicit a “Yes” or “No” answer, it may alsoelicit a longer response, such as response 1714: “No. I just havetrouble falling asleep.” After step 1612, processing according to logicflow diagram 1414, and therefore step 1414 (FIG. 14), completes. Insuccessive iterations of the loop of steps 1404-1416, assessment testadministrator 2202 continues the illustrative example of conversation1700. In the next performance of step 1406, assessment testadministrator 2202 asks question 1712 (FIG. 17). In the next continuingperformance of step 1408, assessment test administrator 2202 receivesresponse 1714. Assessment test administrator 2202 processes response1714 in the next performance of step 1414.

In this illustrative performance of step 1602 (FIG. 16), assessment testadministrator 2202 identifies a single significant element, namely, thatthe patient has trouble falling asleep and doesn't wake in the middle ofthe night. In step 1606, assessment test administrator 2202 records thetype of insomnia (see step 1716) and, in this illustrative embodiment,there are no follow-up questions related to that.

In this illustrative performance of step 1608, assessment testadministrator 2202 dequeues the next question from the conversationqueue. Since no follow-up questions for the type of insomnia and whetherthe patient is treating the insomnia with medication have already beenanswered, the next question is the first child question related tomedication, namely, the particular medication taken by the patient.

In the next iterative performance of step 1406 (FIG. 14), assessmenttest administrator 2202 forms the question, namely, which particularmedication the patient is taking for insomnia. In some embodiments,assessment test administrator 2202 asks that question in the moststraight-forward way, e.g., “You said you're taking medication for yourinsomnia. Which drug are you taking?” This has the advantage of beingopen-ended and eliciting more speech than would a simple yes/noquestion.

In other embodiments, assessment test administrator 2202 accessesclinical data related to the patient to help identify the particulardrug used by the patient. The clinical data may be received frommodeling system 302 (FIG. 22), using clinical data 2220, or fromclinical data server 306 (FIG. 3). Accordingly, assessment testadministrator 2202 may ask a more directed question using the assumeddrug's most common name and generic name. For example, if the patient'sdata indicates that the patient has been prescribed Zolpidem (thegeneric name of the drug sold under the brand name, Ambien), question1720 (FIG. 17) may be, “You said you're taking medication for insomnia.Is that Ambien or Zolpidem?” This highly directed question riskseliciting no more than a simple yes/no response (e.g., response 1722).However, this question also shows a knowledge of, and interest in, thepatient—further garnering goodwill and increasing the likelihood ofhonest responses by the patient and a willingness to continue theassessment test longer.

In this illustrative embodiment, assessment test administrator 2202determines whether to ask a highly directed question rather than a moreopen-ended question based on whether requisite clinical data for thepatient is available and to what degree additional speech is needed toachieve an adequate degree of accuracy in assessing the state of thepatient.

The illustrative example of conversation 1700 continues with assessmenttest administrator 2202 recording the substance of response 1722 in step1724.

Assessment test administrator 2202 in this illustrative embodimentdetermines the responsiveness to the patient also in the mannerassessment test administrator 2202 determines whether the patient hascompleted her response to the most recently asked question, e.g., indetermining when an answer received in step 1408 is complete andselection of the next question in step 1414 may begin.

To further develop good will in the patient, assessment testadministrator 2202 avoids interrupting the patient as much as possible.It helpful to consider response 1704: “Okay . . . . I've been havingtrouble sleeping lately. I have meds for that. They seem to help.” Theellipsis after “Okay.” indicates a pause in replying by the patient. Tothis end, assessment test administrator 2202 waits long enough to permitthe patient to pause briefly without interruption but not so long as tocause the patient to believe that assessment test administrator 2202 hasbecome unresponsive, e.g., due to a failure of assessment testadministrator 2202 or the communications links therewith. Moreover,pauses in speech are used in assessment as described more completelybelow and assessment test administrator 2202 should avoid interferingwith the patient's speech fluency.

In this illustrative embodiment, assessment test administrator 2202 usestwo pause durations, a short one and a long one. After a pause for theshort duration, assessment test administrator 2202 indicates thatassessment test administrator 2202 continues to listen by playing a verybrief sound that acknowledges an understanding and a continuation oflistening, e.g., “uh-huh” or “mmm-hmmm”. After playing the message,assessment test administrator 2202 waits during any continued pause forthe long duration. If the pause continues that long, assessment testadministrator 2202 determines that the patient has completed herresponse.

The particular respective lengths of the short and long durations may bedetermined empirically. In addition, the optimum lengths may vary frompatient to patient. Accordingly, assessment test administrator 2202continues to adjust these durations for the patient whenever interactingwith the patient. Assessment test administrator 2202 recognizesdurations that are too short when observing cross-talk, i.e., when speedis being received from the patient while assessment test administrator2202 concurrently plays any sound. Assessment test administrator 2202recognizes durations that are too long when (i) the patient explicitlyindicates so (e.g., saying “Hello?” or “Are you still there?”) and/or(ii) the patient's response indicates increased frustration or agitationrelative to the patient's speech earlier in the same conversation.

The conversation is terminated politely by assessment test administrator2202 when the assessment test is complete. The assessment test iscomplete when (i) the initial questions in the conversation queue andall of their descendant questions have been answered by the patient or(ii) the measure of confidence in the score resulting from assessmentdetermined in step 1412 is at least a predetermined threshold. It shouldbe noted that confidence in the assessment is not symmetrical. Theassessment test seeks depression, or other behavioral health conditions,in the patient. If it's found quickly, it's found. However, its absenceis not assured by failing to find it immediately. Thus, assessment testadministrator 2202 finds confidence in early detection but not in earlyfailure to detect.

Thus, real-time system 302 (FIG. 22) assesses the current mental stateof the patient using an interactive spoken conversation with the patientthrough patient device 312. Assessment test administrator 2202 sendsdata representing the resulting assessment of the patient to thepatient's doctor or other clinician by sending the data to cliniciandevice 314. In addition, assessment test administrator 2202 records theresulting assessment in clinical data 2220.

While assessment test administrator 2202 is described as conducting aninteractive spoken conversation with the patient to assess the mentalstate of the patient, in other embodiments, assessment testadministrator 2202 passively listens to the patient speaking with theclinician and assesses the patient's speech in the manner describedherein. The clinician may be a mental health professional, a generalpractitioner or a specialist such as a dentist, cardiac surgeon, or anophthalmologist. In one embodiment, assessment test administrator 2202passively listens to the conversation between the patient and clinicianthrough patient device 312 upon determining that the patient is inconversation with the clinician, e.g., by a “START” control on theclinician's iPad. Upon determining that the conversation between thepatient and clinician is completed, e.g., by a “STOP” control on theclinician's iPad, assessment test administrator 2202 ceases passivelylistening and assessing speech in the manner described above. Inaddition, since patient device 312 is listening passively and notprompting the patient, assessment test administrator 2202 makes noattempt to optimize the audiovisual signal received through patientdevice 312 and makes no assumption that faces in any received videosignal are that of the patient.

In some embodiments, at the start of the conversation between thepatient and the clinician, the clinician asks the patient to initiatelistening by assessment test administrator 2202 and the patient does soby issuing a command through patient device 312 that directs assessmenttest administrator 2202 to begin listening. Similarly, at end of theconversation, the clinician asks the patient to terminate listening byassessment test administrator 2202 and the patient does so by issuing acommand through patient device 312 that directs assessment testadministrator 2202 to cease listening.

In alternative embodiments, assessment test administrator 2202 listensto the conversation between the patient and the clinician throughclinician device 314. The clinician may manually start and stoplistening by assessment test administrator 2202 through clinician device314 using conventional user-interface techniques.

During the conversation passively heard by assessment test administrator2202, assessment test administrator 2202 assesses the patient's speechand not the clinician's speech. Assessment test administrator 2202 maydistinguish the voices in any of a number of ways, e.g., by a “MUTE”control on the clinician's iPad. In embodiments in which assessment testadministrator 2202 listens through patient device 312, assessment testadministrator 2202 uses acoustic models (e.g., acoustic models 2218) todistinguish the two voices. Assistant test administrator 2202 identifiesthe louder voice as that of the patient, assuming patient device 312 iscloser to the patient than to the clinician. This may also be the casein embodiments in which clinician device 312 is set up to hear thepatient more loudly. For example, clinician device 314 may be configuredto listen through a highly directional microphone that the cliniciandirects toward the patient such that any captured audio signalrepresents the patient's voice much more loudly than other, ambientsounds such as the clinician's voice. Assessment test administrator 2202may further distinguish the patient's voice from the clinician's voiceusing language models 2214, particularly, semantic pattern models suchas semantic pattern modules 4004, to identify which of the twodistinguished voices more frequently asks questions. Assessment testadministrator 2202 may further distinguish the patient's voice from theclinician's voice using acoustic models 2016, which may identify andsegment out the clinician's voice from an acoustic analysis of theclinician's voice performed prior to the clinical encounter.

Throughout the conversation between the patient and the clinician,assessment test administrator 2202 assesses the mental state of thepatient from the patient's speech in the manner described herein andfinalizes the assessment upon detecting the conclusion of theconversation.

Runtime model server logic 704, shown in greater detail in FIG. 18,processes audiovisual signals representing the patient's responses inthe interactive screening or monitoring conversation and, while theconversation is ongoing, estimates the current health of the patientfrom the audiovisual signals.

Automatic speech recognition (ASR) logic 1804 is logic that processesspeech represented in the audiovisual data from I/O logic 604 (FIG. 6)to identify words spoken in the audiovisual signal. The results of ASRlogic 1804 (FIG. 18) are sent to runtime models 1802.

Runtime models 1802 also receive the audiovisual signals directly fromI/O logic 604. In a manner described more completely below, runtimemodels 1802 combine language, acoustic, and visual models to produceresults 1820 from the received audiovisual signal. In turn, interactivescreening or monitoring server logic 702 uses results 1820 in real timeas described above to estimate the current state of the patient and toaccordingly make the spoken conversation responsive to the patient asdescribed above.

In addition to identifying words in the audiovisual signal, ASR logic1804 also identifies where in the audiovisual signal each word appearsand a degree of confidence in the accuracy of each identified word inthis illustrative embodiment. ASR logic 1804 may also identifynon-verbal content of the audiovisual signals, such as laughter andfillers for example, along with location and confidence information. ASRlogic 1804 makes such information available to runtime models 1802.

Runtime models 1802 include descriptive model and analytics 1812,natural language processing (NLP) model 1806, acoustic model 1808, andvisual model 1810.

NLP model 1806 includes a number of text-based machine learning modelsto (i) predict depression, anxiety, and perhaps other health statesdirectly from the words spoken by the patient and (ii) model factorsthat correlate with such health states. Examples of machine learningthat models health states directly include sentiment analysis, semanticanalysis, language modeling, word/document embeddings and clustering,topic modeling, discourse analysis, syntactic analysis, and dialogueanalysis. Models do not need to be constrained to one type ofinformation. A model may contain information for example from bothsentiment and topic based features. NLP information includes the scoreoutput of specific modules for example the score from a sentimentdetector trained for sentiment rather than for mental health state. NLPinformation includes that obtained via transfer learning based systems.

NLP model 1806 stores text metadata and modeling dynamics and sharesthat data with acoustic model 1808, visual model 1810, and descriptivemodel and analytics 1812. Text data may be received directly from ASRlogic 1804 as described above or may be received as text data from NLPmodel 1806. Text metadata may include, for example, data identifying,for each word or phrase, parts of speech (syntactic analysis), sentimentanalysis, semantic analysis, topic analysis, etc. Modeling dynamicsincludes data representing components of constituent models of NLP model1806. Such components include machine learning features of NLP model1806 and other components such as long short-term memory (LSTM) units,gated recurrent units (GRUs), hidden Markov model (HMM), andsequence-to-sequence (seq2seq) translation information. NLP metadataallows acoustic model 1808, visual model 1810, and descriptive model andanalytics 1812 to correlate syntactic, sentimental, semantic, and topicinformation to corresponding portions of the audiovisual signal.Accordingly, acoustic model 1808, visual model 1810, and descriptivemodel and analytics 1812 may more accurately model the audiovisualsignal.

Runtime models 1802 include acoustic model 1808, which analyzes theaudio portion of the audiovisual signal to find patterns associated withvarious health states, e.g., depression. Associations between acousticpatterns in speech and health are in some cases applicable to differentlanguages without retraining. They may also be retrained on data fromthat language. A of the particular language spoken. Accordingly,acoustic model 1808 analyzes the audiovisual signal in alanguage-agnostic fashion. In this illustrative embodiment, acousticmodel 1808 uses machine learning approaches such as convolutional neuralnetworks (CNN), long short-term memory (LSTM) units, hidden Markovmodels (HMM), etc. for learning high-level representations and formodeling the temporal dynamics of the audiovisual signals.

Acoustic model 1808 stores data representing attributes of theaudiovisual signal and machine learning features of acoustic model 1808as acoustic model metadata and shares that data with NLP model 1806,visual model 1810, and descriptive model and analytics 1812. Theacoustic model metadata may include, for example, data representing aspectrogram of the audiovisual signal of the patient's response. Inaddition, the acoustic model metadata may include both basic featuresand high-level feature representations of machine learning features.More basic features may include Mel-frequency cepstral coefficients(MFCCs), and various log filter banks, for example, of acoustic model1808. High-level feature representations may include, for example,convolutional neural networks (CNNs), autoencoders, variationalautoencoders, deep neural networks, and support vector machines ofacoustic model 1808. The acoustic model metadata allows NLP model 1806to, for example, use acoustic analysis of the audiovisual signal toimprove sentiment analysis of words and phrases. The acoustic modelmetadata allows visual model 1810 and descriptive model and analytics1812 to, for example, use acoustic analysis of the audiovisual signal tomore accurately model the audiovisual signal.

Runtime model server logic 504 (FIG. 18) includes visual model 1810,which infers various health states of the patient from face, gaze andpose behaviors. Visual model 1810 may include facial cue modeling,eye/gaze modeling, pose tracking and modeling, etc. These are merelyexamples.

Visual model 1810 stores data representing attributes of the audiovisualsignal and machine learning features of visual model 1810 as visualmodel metadata and shares that data with NLP model 1806, acoustic model1808, and descriptive model and analytics 1812. For example, the visualmodel metadata may include data representing face locations, posetracking information, and gaze tracking information of the audiovisualsignal of the patient's response. In addition, the visual model metadatamay include both basic features and high-level feature representationsof machine learning features. More basic features may include imageprocessing features of visual model 1810. High-level featurerepresentations may include, for example, CNNs, autoencoders,variational autoencoders, deep neural networks, and support vectormachines of visual model 1810. The visual model metadata allowsdescriptive model and analytics 1812 to, for example, use video analysisof the audiovisual signal to improve sentiment analysis of words andphrases. Descriptive model and analytics 1812 may even use the visualmodel metadata in combination with the acoustic model metadata toestimate the veracity of the patient in speaking words and phrases formore accurate sentiment analysis. The visual model metadata allowsacoustic model 1808 to, for example, use video analysis of theaudiovisual signal to better interpret acoustic signals associated withvarious gazes, poses, and gestures represented in the video portion ofthe audiovisual signal.

Descriptive features or descriptive analytics are interpretabledescriptions that may be computed based on features in the speech,language, video, and metadata that convey information about a speaker'sspeech patterns in a way in which a stakeholder may understand. Forexample, descriptive features may include a speaker sounding nervous oranxious, having a shrill or deep voice, or speaking quickly or slowly.Humans can interpret “features” of voices, such as pitch, rate ofspeaking, and semantics, in order to mentally determine emotions. Adescriptive analytics module, by applying interpretable labels to speechutterances, based on their features, differs from a machine learningmodule. Machine learning models also make predictions by analyzingfeatures, but the methods by which machine learning algorithms processthe features, and determine representations of those features, differsfrom how humans interpret them. Thus, labels that machine learningalgorithms may “apply” to data, in the context of analyzing features,may not be labels that humans may be able to interpret.

Descriptive model and analytics 1812 (FIG. 18) may generate analyticsand labels for numerous health states, not just depression. Examples ofsuch labels include emotion, anxiety, how engaged the patient is,patient energy, sentiment, speech rate, and dialogue topics. Inaddition, descriptive model and analytics 1812 applies these labels toeach word of the patient's response and determines how significant eachword is in the patient's response. While the significance of any givenword in a spoken response may be inferred from the part of speech, e.g.,articles and filler words as relatively insignificant, descriptive modeland analytics 1812 infers a word's significance from additionalqualities of the word, such as emotion in the manner in which the wordis spoken as indicated by acoustic model 1808.

Descriptive model and analytics 1812 also analyzes trends over time anduses such trends, at least in part, to normalize analysis of thepatient's responses. For example, a given patient might typically speakwith less energy than others. Normalizing analysis for this patientmight set a lower level of energy as “normal” than would be used for thegeneral population. In addition, a given patient may use certain wordsmore frequently than the general population and use of such words bythis patient might not be as notable as such use would be by a differentpatient. Descriptive model and analytics 1812 may analyze trends inreal-time, i.e., while a screening or monitoring conversation isongoing, and in non-real-time contexts.

Descriptive model and analytics 1812 stores data representing the speechanalysis and trend analysis described above, as well as metadata ofconstituent models of descriptive model and analytics 1812, asdescriptive model metadata and shares that data with NLP model 1806,acoustic model 1808, and visual model 1810. The descriptive modelmetadata allows NLP model 1806, acoustic model 1808, and visual model1810 to more accurately model the audiovisual signal.

Through runtime models 1802, runtime model server logic 504 estimates ahealth state of a patient using what the patient says, how the patientsays it, and contemporaneous facial expressions, eye expressions, andposes in combination and stores resulting data representing suchestimation as results 1820. Such provides a particularly accurate andeffective tool for estimating the patient's health state.

Runtime model server logic 504 sends results 1820 to I/O logic 604 (FIG.6) to enable interactive screening or monitoring server logic 502 torespond to the patient's responses, thereby making the screening ormonitoring dialogue interactive in the manner described above. Runtimemodel server logic 504 (FIG. 18) also sends results 1820 to screening ormonitoring system data store 410 to be included in the history of thesubject.

Model training logic 506, shown in greater detail in FIG. 19, trains themodels used by runtime model server logic 504 (FIG. 18).

Model training logic 506 (FIG. 19) includes runtime models 1802 and ASRlogic 1804 and trains runtime models 1802. Model training logic 506sends the trained models to model repository 416 to make runtime models1802, as trained, available to runtime model server logic 504.

FIG. 20A provides a more detailed example illustration of the backendscreening or monitoring system of the embodiment of FIG. 2. In thisexample block diagram, the web server 240 is expanded to illustrate thatit includes a collection of functional modules. The primary component ofthe web server 240 includes an input/output (IO) module 2041 foraccessing the system via the network infrastructure 250. This IO 2041enables the collection of response data (in the form of at least speechand video data) and labels from the clients 260 a-n, and thepresentation of prompting information (such as a question or topic), andfeedback to the clients 260 a-n. The prompting materials is driven bythe interaction engine 2043, which is responsive to the needs of thesystem, user commands and preferences to fashion an interaction thatmaintains the clients' 260 a-n engagement and generates meaningfulresponse data. The interaction engine will be discussed in greaterdetail below.

Truthfulness of the patient in answering questions (or other forms ofinteraction) posed by the screening or monitoring test is critical inassessing the patient's mental state, as is having a system that isapproachable and that will be sought out and used by a prospectivepatient. The health screening or monitoring system 200 encourageshonesty of the patient in a number of ways. First, a spoken conversationprovides the patient with less time to compose a response to a question,or discuss a topic, than a written response may take. This truncatedtime generally results in a more honest and “raw” answer. Second, theconversation feels, to the patient, more spontaneous and personal and isless annoying than an obviously generic questionnaire, especially whenuser preferences are factored into the interaction, as will be discussedbelow. Accordingly, the spoken interaction does not induce or exacerbateresentment in the patient for having to answer a questionnaire beforeseeing a doctor or other clinician. Third, the spoken interaction isadapted in progress to be responsive to the patient, reducing thepatient's annoyance with the screening or monitoring test and, in somesituations, shortening the screening or monitoring test. Fourth, thescreening or monitoring test as administered by health screening ormonitoring system 200 relies on more than mere verbal components of theinteraction. Non-verbal aspects of the interaction are leveragedsynergistically with the verbal content to assess depression in thepatient. In effect, ‘what is said’ is not nearly as reliably accurate inassessing depression as is ‘how it's said’.

The final component of the web server 240 is a results and presentationmodule 2045 which collates the results from the model server(s) 230 andprovides then to the clients 260 a-n via the IO 2041, as well asproviding feedback information to the interaction engine 2043 fordynamically adapting the course of the interaction to achieve thesystem's goals. Additionally, the results and presentation module 2045additionally supplies filtered results to stakeholders 270 a-n via astakeholder communication module 2003. The communication module 2003encompasses a process engine, routing engine and rules engine. The rulesengine embodies conditional logic that determines what, when and who tosend communications to, the process engine embodies clinical andoperational protocol logic to pass messages through a communicationschain that may be based on serial completion of tasks and the routingengine gives the ability to send any messages to the user's platform ofchoice (e.g., cellphone, computer, landline, tablet, etc.).

The filtering and/or alteration of the results by the results andpresentation module 2045 is performed when necessary to maintain HIPAA(Health Insurance Portability and Accountability Act of 1996) and otherprivacy and security regulations and policies such as GDPR and SOC 2compliance as needed and to present the relevant stakeholder 270 a-nwith information of the greatest use. For example, a clinician maydesire to receive not only the screening or monitoring classification(e.g., depressed or neurotypical) but additional descriptive features,such as suicidal thoughts, anxiety around another topic, etc. Incontrast, an insurance provider may not need or desire many of theseadditional features, and may only be concerned with adiagnosis/screening or monitoring result. Likewise, a researcher may beprovided only aggregated data that is not personally identifiable, inorder to avoid transgression of privacy laws and regulations.

The IO 2041, in addition to connecting to the clients 260 a-n, providesconnectivity to the user data 220 and the model server(s) 230. Thecollected speech and video data (raw audio and video files in someembodiments) are provided by the IO 2041 to the user data 220, runtimemodel server(s) 2010 and a training data filter 2001. Label data fromthe clients 260 a-n is provided to a label data set 2021 in the userdata 220. This may be stored in various databases 2023. Label dataincludes not only verified diagnosed patients, but inferred labelscollected from particular user attributes or human annotation. Client IDinformation and logs may likewise be supplied from the IO 2041 to theuser data 220. The user data 220 may be further enriched with clinicaland social records 210 sourced from any number of third party feeds.This may include social media information obtained from web crawlers,EHR databases from healthcare providers, public health data sources, andthe like.

The training data filter 2001 may consume speech and video data andappend label data 2021 to it to generate a training dataset. Thistraining dataset is provided to model training server(s) 2030 for thegeneration of a set of machine learned models. The models are stored ina model repository 2050 and are utilized by the runtime model server(s)2010 to make a determination of the screening or monitoring results, inaddition to generating other descriptors for the clients 260 a-n. Themodel repository 2050 together with the model training server(s) 2030and runtime model server(s) 2010 make up the model server(s) 250. Theruntime model server(s) 2010 and model training server(s) 2030 aredescribed in greater detail below in relation to FIGS. 20B and 21,respectively.

In FIG. 20B the runtime model server(s) 2010 is provided in greaterdetail. The server received speech and video inputs that originated fromthe clients 260 a-n. A signal preprocessor and multiplexer 2011 performsconditioning on the inputted data, such as removal of noise or otherartifacts in the signal that may cause modeling errors. These signalprocessing and data preparation tasks include diarization, segmentationand noise reduction for both the speech and video signals. Additionally,metadata may be layered into the speech and video data. This data may besupplied in this preprocessed form to a bus 2014 for modelers 2020consumption and may also be subjected to any number of third parties,off the shelf Automatic Speech Recognition (ASR) systems 2012. The ASR2012 output includes a machine readable transcription of the speechportion of the audio data. This ASR 2012 output is likewise supplied tothe bus 2014 for consumption by later components. The signalpreprocessor and multiplexer 2011 may be provided with confidencevalues, such as audio quality (signal quality, length of sample) andtranscription confidence (how accurate the transcription is) values 2090and 2091.

FIG. 20B also includes a metadata model 2018. The metadata model mayanalyze patient data, such as demographic data, medical history data,and patient-provided data.

Additionally, clinical data, demographic data, and social data may bepresented to the bus 2014 for subsequent usage by the modelers 202.Lastly, a model reader 2013 may access protected models from a modelrepository 2050 which are likewise provided to the bus 2014. Themodelers 2020 consume the models, preprocessed audio and visual data,and ASR 2012 output to analyze the clients' 260 a-n responses for thehealth state in question. Unlike prior systems for modeling a healthcondition, the present system includes a natural language processing(NLP) model 2015, acoustic model 2016, and video model 2017 that alloperate in concert to generate classifications for the clients' 260 a-nhealth state. These modelers not only operate in tandem, but consumeoutputs from one another to refine the model outputs. Each of themodelers and the manner in which they coordinate to enhance theirclassification accuracy will be explored in greater detail inconjunction with subsequent figures.

The output for each of these modelers 2020 is provided, individually, toa calibration, confidence, and desired descriptors module 2092. Thismodule calibrates the outputs in order to produce scaled scores, as wellas provides confidence measures for the scores. The desired descriptorsmodule may assign human-readable labels to scores. The output of desireddescription module 2092 is provided to model weight and fusion engine2019. This model weight and fusion engine 2019 combines the modeloutputs into a single consolidated classification for the health stateof each client 260 a-n. Model weighting may be done using staticweights, such as weighting the output of the NLP model 2015 more thaneither the acoustic model 2016 or video model 2017 outputs. However,more robust and dynamic weighting methodologies may likewise be applied.For example, weights for a given model output may, in some embodiments,be modified based upon the confidence level of the classification by themodel. For example, if the NLP model 2015 classifies an individual asbeing not depressed, with a confidence of 0.56 (out of 0.00-1.00), butthe acoustic model 2016 renders a depressed classification with aconfidence of 0.97, in some cases the weight of a the models' outputsmay be weighted such that the acoustic model 2016 is provided a greaterweight. In some embodiments, the weight of a given model may be linearlyscaled by the confidence level, multiplied by a base weight for themodel. In yet other embodiments, model output weights are temporallybased. For example, generally the NLP model 2015 may be afforded agreater weight than other models, however, when the user isn't speaking,the video model 2017 may be afforded a greater weight for that timedomain. Likewise, if the video model 2017 and acoustic model 2016 areindependently suggesting the person is being nervous and untruthful(frequent gaze shifting, perspiration increased, pitch modulationupward, increased speech rate, etc.) then the weight of the NLP model2015 may be minimized, since it is likely the individual is notanswering the question truthfully.

After model output fusion and weighting the resulting classification maybe combined with features and other user information in a multiplexoutput module 2051 in order to generate the final results. As discussedbefore, these results are provided back to the user data 220 for storageand potentially as future training materials, and also to the resultsand presentation module 2045 of the webserver 240 for display, at leastin part, to the clients 260 a-n and the stakeholders 270 a-n. Theseresults are likewise used by the interaction engine 2043 to adapt theinteraction with the client 260 a-n moving forward.

Turning now to FIG. 21, the model training server(s) 2030 is provided ingreater detail. Like the runtime model server(s) 2010, the modeltraining server(s) 2030 consume a collection of data sources. However,these data sources have been filtered by the training data filter 2001to provide only data for which label information is known or imputable.The model training server additionally takes as inputs audio qualityconfidence values 2095 (which may include bit rate, noise, and length ofthe audio signal) and transcription confidence values 2096. Theseconfidence values may include the same types of data as those of FIG.20B. The filtered social, demographic, and clinical data, speech andvideo data, and label data are all provided to a preprocessor 2031 forcleaning and normalization of the filtered data sources. The processeddata is then provided to a bus 2040 for consumption by various trainers2039, and also to one or more third party ASR systems 2032 for thegeneration of ASR outputs, which are likewise supplied to the bus 2040.The signal preprocessor and multiplexer 2011 may be provided withconfidence values, such as audio quality (signal quality, length ofsample) and transcription confidence (how accurate the transcription is)values 2095 and 2096.

The model trainers 2039 consume the processed audio, visual, metadata,and ASR output data in a NLP trainer 2033, an acoustic trainer 2034, avideo trainer 2035, and a metadata trainer 2036. The trained models areprovided, individually, to a calibration, confidence, and desireddescriptors module 2097. This module calibrates the outputs in order toproduce scaled scores, as well as provides confidence measures for thescores. The desired descriptors module may assign human-readable labelsto scores. The trained and calibrated models are provided to a fusedmodel trainer 2037 for combining the trained models into a trainedcombinational model. Each individual model and the combined model may bestored in the model repository 2050. Additionally and optionally, thetrained models may be provided to a personalizer 2038, which leveragesmetadata (such as demographic information and data collated from socialmedia streams) to tailor the models specifically for a given client 260a-n.

For example, a particular model xo may be generated for classifyingacoustic signals as either representing someone who is depressed, ornot. The tenor, pitch and cadence of an audio input may varysignificantly between a younger individual versus and elderlyindividual. As such, specific models are developed based upon if thepatient being screened is younger or elderly (models xy and xerespectively). Likewise, women generally have variances in theiracoustic signals as compared to men, suggesting that yet another set ofacoustic models are needed (models xf and xm respectively). It is alsoapparent that combinational models are desired for a young woman versusan elderly woman, and a young man versus an elderly man (models xyf,xef, xym and xem respectively). Clearly, as further personalizationgroupings are generated the possible number of applicable models willincrease exponentially.

In some embodiments, if the metadata for an individual provides insightinto that person's age, gender, ethnicity, educational background,accent/region they grew up in, etc. this information may be utilized toselect the most appropriate model to use in future interactions withthis given patient, and may be likewise used to train models that applyto individuals that share similar attributes.

In addition to personalizing models based upon population segments andattributes, the personalizer 2038 may personalize a model, or set ofmodels, for a particular individual based upon their past history andlabel data known for the individual. This activity is morecomputationally expensive than relying upon population wide, or segmentwide, modeling, but produces more accurate and granular results. Allpersonalized models are provided from the personalizer 2038 to the modelrepository 2050 for retention until needed for patient assessment.

During analysis then, a client 260 a-n is initially identified, and whenable, a personalized model may be employed for their screening ormonitoring. If not available, but metadata is known for the individual,the most specific model for the most specific segment is employed intheir screening or monitoring. If no metadata is available, then themodel selected is the generic, population-wide model. Utilizing such atiered modeling structure, the more information that is known regardingthe client 260 a-n allows for more specific and accurate models to beemployed. Thus, for each client 260 a-n, the ‘best’ model is leveragedgiven the data available for them.

The general overall flow of information is shown in (FIG. 22).Assessment test administrator 2202 of real-time system 302 conducts aninteractive conversation with the patient through patient device 312.The responsive audiovisual signal of the patient is received byreal-time system 302 from patient device 312. The exchange ofinformation between real-time system 302 and patient device 312 may bethrough a purpose-built app executing in patient device 112 or through aconventional video call between patient device 312 and video call logicof assessment test administrator 2202. While this illustrativeembodiment uses an audiovisual signal to assess the state of thepatient, it should be appreciated that, in alternative embodiments, anaudio-only signal may be used with good results. In such alternativeembodiments, an ordinary, audio-only telephone conversation may serve asthe vehicle for assessment by assessment test administrator 2202.

In a manner described more completely below, assessment testadministrator 2202 uses composite model 2204 to assess the state of thepatient in real-time, i.e., as the spoken conversation transpires. Suchintermediate assessment is used, in a manner described more completelybelow, to control the conversation, making the conversation moreresponsive, and therefore more engaging, to the patient and to help makethe conversation as brief as possible while maintaining the accuracy ofthe final assessment.

Modeling system 304 receives collected patient data 2206 that includesthe audiovisual signal of the patient during the assessment test. Inembodiments in which the assessment test involves patient device 312 apurpose-built app executing in patient device 312, modeling system 104may receive collected patient data 2206 from patient device 312.Alternatively, and in embodiments in which the assessment test involvesa video or voice call with patient device 312, modeling system 304receives collected patient data 2206 from real-time system 302.

Modeling system 304 retrieves clinical data 2220 from clinical dataserver 306. Clinical data 2220 includes generally any available clinicaldata related to the patient, other patients assessed by assessment testadministrator 2202, and the general public that may be helpful intraining any of the various models described herein.

Preprocessing 2208 conditions any audiovisual data for optimum analysis.Having a high-quality signal to start is very helpful in providingaccurate analysis. Preprocessing 2208 is shown within modeling system304. In alternative embodiments, preprocessing is included in real-timesystem 302 to improve accuracy in application of composite model 204.

Speech recognition 2210 processes speech represented in the audiovisualdata after preprocessing 2208, including automatic speech recognition(ASR). ASR may be conventional. Language model training 2212 uses theresults of speech recognition 2210 to train language models 214.

Acoustic model training 2216 uses the audiovisual data afterpreprocessing 2208 to train acoustic models 2218. Visual model training2224 uses the audiovisual data after preprocessing 2208 to train visualmodels 2226. To the extent sufficient data (both collected patient data2206 and clinical data 2222) is available for the subject patient,language model training 2212, acoustic model training 2216, and visualmodel training 2224 train language models 2214, acoustic models 2218,and visual models 2226, respectively, specifically for the subjectpatient. Training may also use clinical data 2222 for patients thatshare one or more phenotypes with the subject patient.

In a manner described more completely below, composite model builder2222 uses language models 2214, acoustic models 2218, and visual models2226, in combination with clinical data 2220, to combine language,acoustic, and visual models into composite model 2204. In turn,assessment test administrator 2202 uses composite model 2204 in realtime to assess the current state of the subject patient and toaccordingly make the spoken conversation responsive to the subjectpatient as described more completely below.

As mentioned above, assessment test administrator 2202 administers adepression assessment test to the subject patient by conducting aninteractive spoken conversation with the subject patient through patientdevice 312.

Attention will now be focused upon the specific models used by theruntime model server(s) 2010. Moving on to FIG. 23A, a general blockdiagram for one example substantiation of the acoustic model 2016 isprovided. The speech and video data is provided to a high level featurerepresentor 2320 that operates in concert with a temporal dynamicsmodeler 2330. Influencing the operation of these components is a modelconditioner 2340 that consumed features from the descriptive features2018, results generated from the speech and video models 2015 and 2017,respectively, and clinical and social data.

Returning to the acoustic model 2016, the high level feature representor2320 and temporal dynamics modeler 2330 also receive raw and higherlevel feature extractor 2310 outputs, that identify features within theincoming acoustic signals, and feeds them to the models. The high levelfeature representor 2320 and temporal dynamics modeler 2330 generate theacoustic model results, which may be fused into a final result thatclassifies the health state of the individual, and may also be consumedby the other models for conditioning purposes.

The high level feature representor 2320 includes leveraging existingmodels for frequency, pitch, amplitude and other acoustic features thatprovide valuable insights into feature classification. A number ofoff-the-shelf “black box” algorithms accept acoustic signal inputs andprovide a classification of an emotional state with an accompanyingdegree of accuracy. For example, emotions such as sadness, happiness,anger and surprise are already able to be identified in acoustic samplesusing existing solutions. Additional emotions such as envy, nervousness,excited-ness, mirth, fear, disgust, trust and anticipation will also beleveraged as they are developed. However, the present systems andmethods go further by matching these emotions, strength of the emotion,and confidence in the emotion, to patterns of emotional profiles thatsignify a particular mental health state. For example, patternrecognition may be trained, based upon patients that are known to besuffering from depression, to identify the emotional state of arespondent that is indicative of depression.

FIG. 23B shows an embodiment of FIG. 23A including an acoustic modelingblock 2341. The acoustic modeling block 2341 includes a number ofacoustic models. The acoustic models may be separate models that usemachine learning algorithms. The illustrated listing of models shown inFIG. 23B is not necessarily an exhaustive listing of possible models.These models may include a combination of existing third party modelsand internally derived models. FIG. 23B includes acoustic embeddingmodel 2342, spectral temporal model 2343, acoustic effect model 2345,speaker personality model 2346, intonation model 2347, temporal/speakingrate model 2348, pronunciation models 2349, and fluency models 2361. Themachine learning algorithms used by these models may include neuralnetworks, deep neural networks, support vector machines, decision trees,hidden Markov models, and Gaussian mixture models.

FIG. 23C shows a score calibration and confidence module 2370. The scorecalibration and confidence module 2370 includes a score calibrationmodule 2371 and a performance estimation module 2374. The scorecalibration module 2371 includes a classification module 2372 and amapping module 2373.

The score calibration and confidence module 2370 may accept as inputs araw score, produced by a machine learning algorithm, such as a neuralnetwork or deep learning network, that may be analyzing audiovisualdata. The score calibration and confidence module 2370 may also accept aset of labels, with which to classify data. The labels may be providedby clinicians. The classification module 2371 may apply one or morelabels to the raw score, based on the value of the score. For example,if the score is a probability near 1, the classification module 2371 mayapply a “severe” label to the score. The classification module 2371 mayapply labels based on criteria set by clinicians, or may algorithmicallydetermine labels for scores, e.g., using a machine learning algorithm.The mapping module 2372 may scale the raw score to fit within a range ofnumbers, such as 120-180 or 0-700. The classification module 2371 mayoperate before or after the mapping module 2372.

After calibrating the data, the score calibration and confidence module2370 may determine a confidence measure 2376 by estimating a performancefor the labeled, scaled score. The performance may be estimated byanalyzing features of the collected data, such as duration, soundquality, accent, and other features. The estimated performance may be aweighted parameter that is applied to the score. This weighted parametermay comprise the score confidence.

To provide greater context, and clarification around the acousticmodel's 2016 operation, a highly simplified and single substantiation ofone possible version of the high level feature representor 2320 isprovided in relation to FIG. 24. It should be noted that this example isprovided for illustrative purposes only, and is not intended to limitthe embodiments of the high level feature representor 2320 in any way.

In this example embodiment, the raw and high level feature extractor2310 takes the acoustic data signal and converts it into a spectrogramimage 2321. FIG. 55 provides an example image of such a spectrogram 5500of a human speaking. A spectrogram of this sort provides informationalong one axis regarding the audio signal frequency, amplitude of thesignal (here presented in terms of intensity/how dark the frequency islabeled), and time. Such a spectrogram 5500 is considered a raw featureof the acoustic signal, as would pitch, cadence, energy level, etc.

A spectrogram sampler 2323 then selects a portion of the image at aconstant timeframe, for example between time zero and 10 seconds is onestandard sample size, but other sample time lengths are possible. FIG.56 provides an example of a sampled portion 5502 of the spectrogram5600. This image data the then represented as an M×N matrix (x), in thisparticular non-limiting example. An equation that includes x as avariable, and for which the solution is known, is then processed todetermine estimates of the unknown variables (matrices and vectors)within the equation. For example, a linear equation such as: ŷ=w^(T)x+bmay be utilized. As noted, the solution y is known.

This includes determining a set of randomized guesses for the unknownvariables (w^(T) and b in this example equation). The equation is solvedfor, using these guessed variables, and the error of this solvedsolution is computed using the known solution value. The error may becomputed as:

$\hat{E}{{= \frac{\left( {y - \hat{y}} \right)^{2}}{N}}.}$

By repeating this process iteratively, thousands if not millions oftimes, values for the variables that are approximates to the actualvariable values may be determined. This is a brute force regression,where the error value (Ê) is minimized for.

This approximate value is an abstraction of the mental state beingtested, dependent upon the input equation. The system may havepreviously determined threshold, or cutoff values 2322, for thevariables which indicate if the response is indicative of the mentalstate or not. These cutoff values are trained for by analyzing responsesfrom individuals for which the mental state is already known.

Equation determination may leverage deep learning techniques, aspreviously discussed. This may include recurrent neural networks 2324and/or convolutional neural networks 2325. In some cases, longshort-term memory (LSTM) or gated recurrent unit (GRU) may be employed,for example. In this manner, depression, or alternate mental states maybe directly analyzed for in the acoustic portion of the response. This,in combination with using off-the-shelf emotion detection ‘black box’systems, with pattern recognition, may provide a robust classificationby a classifier 2326 of the mental state based upon the acoustic signalwhich, in this example, is provided as acoustic analysis output 2327.

As noted above, this example of using a spectrogram as a feature foranalysis is but one of many possible substantiations of the high levelfeature representor's 2320 activity. Other features and mechanisms forprocessing these features may likewise be analyzed. For example pitchlevels, isolated breathing patterns, total energy of the acousticsignal, or the like may all be subject to similar temporally basedanalysis to classify the feature as indicative of a health condition.

Turning now to FIG. 25, the NLP model 2015 is provided in greaterdetail. This system consumes the output from the ASR system 2012 andperforms post-processing on it via an ASR output post processor 2510.This post processing includes reconciling the ASR outputs (when multipleoutputs are present). Post processing may likewise include n-gramgeneration, parsing activities and the like.

Likewise, the results from the video and acoustic models 2016 and 2017respectively, as well as clinical and social data are consumed by amodel conditioner 2540 for altering the functioning of the languagemodels 2550. The language models 2550 operate in concert with a temporaldynamics modeler 2520 to generate the NLP model results.

The language models 2550 include a number of separate models. Theillustrated listing of models shown in FIG. 25 is not necessarily anexhaustive listing of possible models. These models may include acombination of existing third party models and internally derivedmodels. Language models may use standard machine learning or deeplearning algorithms, as well as language modeling algorithms such asn-grams. For example, sentiment model 2551 is a readily available thirdparty model that uses either original text samples or spoken samplesthat have been transcribed by a human or machine speech recognizer, tooutput to determine if the sentiment of the discussion is generallypositive or negative. In general, a positive sentiment is inverselycorrelated with depression, whereas a negative sentiment is correlatedwith a depression classification.

Statistical language model 2552 utilizes n-grams and pattern recognitionwithin the ASR output to statistically match patterns and n-gramfrequency to known indicators of depression. For example, particularsequences of words may be statistically indicative of depression.Likewise, particular vocabulary and word types used by a speaker mayindicate depression or not having depression.

A topic model 2553 identifies types of topics within the ASR output.Particular topics, such as death, suicide, hopelessness and worth (orlack thereof) may all be positively correlated with a classification ofdepression. Additionally, there is a latent negative correlation betweenactivity (signified by verb usage) and depression. Thus, ASR outputsthat are high in verb usage may indicate that the client 260 a-n is notdepressed. Furthermore, topic modeling based on the known question orprompt given the subject, can produce better performance via usingpre-trained topic-specific models for processing the answer for mentalhealth state.

Syntactic model 2554 identifies situations where the focus of the ASRoutput is internal versus external. The usage of terms like ‘I’ and ‘me’are indicative of internal focus, while terms such as ‘you’ and ‘they’are indicative of a less internalized focus. More internal focus hasbeen identified as generally correlated with an increased chance ofdepression. Syntactic model 2554 may additionally look at speechcomplexity. Depressed individuals tend to have a reduction in sentencecomplexity. Additionally, energy levels, indicated by language that isstrong or polarized, is negatively correlated with depression. Thus,someone with very simple, sentences focused internally, and with lowenergy descriptive language would indicate a depressed classification.

Embedding and clustering model 2556 maps words to prototypical words orword categories. For example, the terms “kitten”, “feline” and “kitty”may all be mapped to the term “cat”. Unlike the other models, theembedding and clustering model 2556 does not generate a directindication of whether the patient is depressed or not, rather thismodel's output is consumed by the other language models 2550.

A dialogue and discourse model 2557 identifies latency and usage ofspacer words (“like”, “umm”, etc.) Additionally the dialogue anddiscourse model 2557 identifies dialogue acts such as questions versusstatements.

An emotion or affect model 2558 provides a score, typically a posteriorprobability over a set of predetermined emotions (for example happy,sad) that describes how well the sample matches pre-trained models foreach of the said emotions. These probabilities can then be used invarious forms as input to the mental health state models, and/or in atransfer learning set up. A speaker personality model 2559 provides ascore, typically a posterior probability over a set of predeterminedspeaker personality traits (for example agreeableness, openness) thatdescribes how well the sample matches pre-trained models for each of thesaid traits. These probabilities can then be used in various forms asinput to the mental health state models, and/or in a transfer learningset up.

The non-verbal model 2561 using ASR events may provide a score based onnon-lexical speech utterances of patients, which may regardless beindicative of mental state. These utterances may be laughter, sighs, ordeep breaths, which may be picked up and transcribed by an ASR.

The text quality confidence module 2560 determines a confidence measurefor the output of the ASR output post processor 2510. The confidencemeasure may be determined based on text metadata (demographicinformation about the patient, environmental conditions, method ofrecording, etc.) as well as context (e.g., length of speech sample,question asked).

It should be noted that each of these models may impact one another andinfluence the results and/or how these results are classified. Forexample, a low energy language response typically is indicative ofdepression, whereas high energy verbiage would negatively correlate withdepression.

Turning now to the video model 2017 of FIG. 26, again we see acollection of feature extractors 2610 that consume the video data.Within the feature extractors 2610 there is a face bounder 2611 whichrecognizes the edges of a person's face, and extract this region of theimage for processing. Obviously, facial features provide significantinput on how an individual is feeling. Sadness, exhaustion, worry, andthe like, are all associated with a depressive state, whereasjubilation, excitation, and mirth are all negatively correlated withdepression.

Additionally, more specific bounders are contemplated, for example theregion around the eyes may be analyzed separately from regions aroundthe mouth. This allows greater emphasis to be placed upon differingimage regions based upon context. In this set of examples, the regionaround the mouth generally provides a large amount of informationregarding an individual's mood, however when a person is speaking, thisdata is more likely to be inaccurate due to movements associated withthe speech formation. The acoustic and language models may provideinsight as to when the user is speaking in order to reduce reliance onthe analysis of a mouth region extraction. In contrast, the regionaround the eyes is generally very expressive when someone is speaking,so the reliance upon this feature is relied upon more during times whenthe individual is speaking.

A pose tracker 2612 is capable or looking at larger body movements orpositions. A slouched position indicates unease, sadness, and otherfeatures that indicate depression. The presence of excessing fidgeting,or conversely unusual stillness likewise are indicative of depression.Moderate movement and fidgeting, however, is not associated withdepression. Upright posture and relaxed movement likewise are inverselyrelated to a depressive classification. Lastly, even the direction thatthe individual sits or stands is an indicator of depression. A user whodirectly faces the camera is less likely to be depressed. In contrast,an individual that positions their body oblique to the camera, orotherwise covers themselves (by crossing their arms for example) is morelikely to be depressed.

A gaze tracker 2613 is particularly useful in determining where the useris looking, and when (in response to what stimulus) the person's gazeshifts. Looking at the screen or camera of the client device 260 a-nindicates engagement, confidence and honesty—all hallmarks of anon-depressed state. Looking down constantly, on the other hand, issuggestive of depression. Constantly shifting gaze indicates nervousnessand dishonesty. Such feedback may be used by the NLP model 2015 toreduce the value of analysis based on semantics during this time periodas the individual is more likely to be hedging their answers and/oroutright lying. This is particularly true if the gaze pattern altersdramatically in response to a stimulus. For example, if the system asksif the individual has had thoughts of self-harm, and suddenly the userlooks away from the camera and has a shifting gaze, a denial of suchthought (which traditionally would be counted strongly as an indicationof a non-depressed state) is discounted. Rather, emphasis is placed onoutputs of the acoustic model 2016, and the video model 2017 fromanalysis of the other extracted features.

The image processing features extractor 2614 may take the form of anynumber of specific feature extractions, such as emotion identifiers,speaking identifiers (from the video as opposed to the auditory data),and the above disclosed specific bounder extractors (region around theeyes for example). All of the extracted features are provided to ahigh-level feature representor 2620 and classifier and/or regressor 2630that operate in tandem to generate the video model results. As with theother models, the video model 2017 is influenced by the outputs of theNLP model 2015 and the acoustic model 2016, as well as clinical andsocial data. The model conditioner 2640 utilizes this information tomodify what analysis is performed, or the weight afforded to anyspecific findings.

The descriptive features module 2018 of FIG. 27 includes directmeasurements 2710 and model outputs 2720 that result from the analysisof the speech and video data. The descriptive features module may not beincluded in either the runtime model servers 2010 or model trainingservers 2030. Instead, descriptive features may be incorporated in theacoustic and NLP models. Disclosed in the description of FIG. 27 areexamples of descriptive features. Many different measurements 2710 andmodel outputs 2720 are collected by the descriptive features 2018module. For example, measurements include at least speech rate analyzer2711 which tracks a speaker's words per minute. Faster speech generallyindicates excitement, energy and/or nervousness.

Slow speech rates on the other hand are indicative of hesitancy,lethargy, or the presence of a difficult topic. Alone, this measurementhas little value, but when used as an input for other models, the speechrate provides context that allows for more accurate classification bythese other models. Likewise energy analyzer 2713 measures the totalacoustic energy in an audio component. Increased energy may indicateemphasis on particular portions of the interaction, general excitementor lethargy levels, and the like. Again, such information alone providesvery little in determining if a person has depression, but when combinedwith the other models is useful for ensuring that the appropriateclassification is being made. For example, if the energy level increaseswhen a person is speaking about their pet dog, the system determinesthat this topic is of interest to the individual, and if a longerinteraction is needed to collect additional user data for analysis, theinteraction may be guided to this topic. A temporal analyzer 2715determines the time of the day, week and year, in order to providecontext around the interaction. For example, people are generally moredepressed in the winter months, around particular holidays, and atcertain days of the week and times of the day. All this timinginformation is usable to alter the interaction (by providing topicality)or by enabling classification thresholds to be marginally altered toreflect these trends.

The model outputs 2720 may include a topic analyzer 2721, variousemotion analyzers 2723 (anxiety, joy, sadness, etc.), sentiment analyzer2725, engagement analyzer 2727, and arousal analyzer 2729. Some of theseanalyzers may function similarly in the other models; for example theNLP model 2015 already includes a sentiment model 2551, however thesentiment analyzer 2725 in the descriptive features 2018 module operatesindependently from the other models, and includes different inputvariables, even if the output is similar.

The engagement analyzer 2727 operates to determine how engaged a client260 a-n is in the interaction. High levels of engagement tend toindicate honesty and eagerness. Arousal analyzer 2729 provides insightsinto how energetic or lethargic the user is. A key feature of thedescriptive features 2018 module is that each of these features, whethermeasured or the result of model outputs, is normalized by the individualby a normalizer 2730. For example, some people just speak faster thanothers, and a higher word per minute measurement for this individualversus another person may not indicate anything unusual. The degree ofany of these features is adjusted for the baseline level of theparticular individual by the normalizer 2730. Obviously, the normalizer2730 operates more accurately the more data that is collected for anygiven individual. A first time interaction with a client 260 a-n cannotbe effectively normalized immediately, however as the interactionprogresses, the ability to determine a baseline for this person's speechrate, energy levels, engagement, general sentiment/demeanor, etc. may bemore readily ascertained using standard statistical analysis ofvariation of these features over time. This becomes especially trueafter more than one interaction with any given individual.

After normalization, the system may identify trends in these featuresfor the individual by analysis by a trend tracker 2740. The trendtracker splits the interaction by time domains and looks for changes invalues between the various time periods. Statistically significantchanges, and especially changes that continue over multiple timeperiods, are identified as trends for the feature for this individual.The features, both in raw and normalized form, and any trends are alloutput as the descriptive results.

Although not addressed in any of the Figures, it is entirely within thescope of embodiments of this disclosure that additional models areemployed to provide classification regarding the client's 260 a-n healthstate using alternate data sources. For example, it has been discussedthat the client devices may be capable of collecting biometric data(temperature, skin chemistry data, pulse rate, movement data, etc.) fromthe individual during the interaction. Models focused upon these inputsmay be leveraged by the runtime model server(s) 2010 to arrive atdeterminations based upon this data. The disclosed systems may identifychemical markers in the skin (cortisol for example), perspiration,temperature shifts (e.g. flushing), and changes in heart rate, etc. fordiagnostic purposes.

Now that the specifics of the runtime model server(s) 2010 has beendiscussed in considerable depth, attention will be turned to theinteraction engine 2043, as seen in greater detail in relation to FIG.28. A process flow diagram featuring the components of the interactionengine 2043 is featured in FIG. 8. The interaction engine 2043 dictatesthe interactions between the web server(s) 240 and the clients 260 a-n.These interactions, as noted previously may consist of a question andanswer session, with a set number and order or questions. In suchembodiments, this type of assessment is virtually an automated versionof what has previously been leveraged for depression diagnosis, exceptwith audio and video capture for improved screening or monitoringaccuracy. Such question and answer may be done with text questionsdisplayed on the client device, or through a verbal recording of aquestion. However, such systems are generally not particularly engagingto a client 260 a-n, and may cause the interaction to not be completedhonestly, or terminated early. As such, it is desirable to have adynamic interaction which necessitates a more advanced interactionengine 2043, such as the one seen in the present Figure.

This interaction engine 2043 includes the ability to take a number ofactions, including different prompts, questions, and other interactions.These are stored in a question and action bank 2810. The interactionengine 2043 also includes a history and state machine 2820 which trackswhat has already occurred in the interaction, and the current state ofthe interaction.

The state and history information, database of possible questions andactions, and additional data is consumed by an interaction modeler 2830for determining next steps in the interaction. The other informationconsumed consists of user data, clinical data and social data for theclient being interacted with, as well as model results, NLP outputs anddescriptive feature results. The user data, clinical data and socialmedia data are all consumed by a user preference analyzer 2832 foruncovering the preferences of a user. As noted before, appealing to theuser is one of the large hurdles to successful screening or monitoring.If a user doesn't want to use the system they will not engage it in thefirst place, or may terminate the interaction prematurely.Alternatively, an unpleasant interaction may cause the user to be lesshonest and open with the system. Not being able to properly screenindividuals for depression, or health states generally, is a seriousproblem, as these individuals are likely to continue struggling withtheir disease without assistance, or even worse die prematurely. Thus,having a high degree of engagement with a user may literally save lives.

By determining preference information, the interactions are tailored ina manner that appeals to the user's interests and desires. Topicsidentified within social media feeds are incorporated into theinteraction to pique interest of the user. Collected preference datafrom the user modulates the interaction to be more user friendly, andparticular needs or limitations of the user revealed in clinical dataare likewise leveraged to make the interaction experience user-friendly.For example, if the clinical data includes information that the userexperiences hearing loss, the volume of the interaction may beproportionally increased to make the interaction easier. Likewise, ifthe user indicates their preferred language is Spanish, the system mayautomatically administer the interaction in this language.

The descriptive features and model results, in contrast, are used by auser response analyzer 2831 to determine if the user has answered thequestion (when the interaction is in a question-answer format), or whensufficient data has been collected to generate an appropriateclassification if the interaction is more of a ‘free-form’ conversation,or even a monologue by the client about a topic of interest.

Additionally, a navigation module 2834 receives NLP outputs andsemantically analyzes the NLP results for command language in near realtime. Such commands may include statements such as “Can you repeatthat?”, “Please speak up”, “I don't want to talk about that”, etc. Thesetypes of ‘command’ phrases indicate to the system that an immediateaction is being requested by the user.

Output from each of the navigation module 2834, user response analyzer2831 and user preference analyzer 2832 are provided to an actiongenerator 2833, in addition to access to the question and adaptiveaction bank 2810 and history and state machine 2820. The actiongenerator 2833 applies a rule based model to determine which actionwithin the question and adaptive action bank 2810 is appropriate.Alternatively, a machine learned model is applied in lieu of a rulebased decision model. This results in the output of a customized actionthat is supplied to the IO 2041 for communication to the client 260 a-n.The customized action is likewise passed back to the history and statemachine 2820 so that the current state, and past actions may be properlylogged. Customized actions may include, for example, asking a specificquestion, prompting a topic, switching to another voice or language,ending the interaction, altering the loudness of the interaction,altering speech rates, font sizes and colors, and the like.

Now that the structures and systems of the health screening ormonitoring system 2000 have been described in considerable detail,attention will now be turned to one example process 2900 of healthscreening or monitoring of a client. In this example process theclinical and social data for the clients are collated and stored withinthe data store (at step 2910). This information may be gathered fromsocial media platforms utilizing crawlers or similar vehicles. Clinicaldata may be collected from health networks, physicians, insurancecompanies or the like. In some embodiments, the health screening ormonitoring system 2000 may be deployed as an extension of the careprovider, which allows the sharing of such clinical data with reducedconcerns with violation of privacy laws (such as HIPAA). However, whenthe health screening or monitoring system 2000 is operated as a separateentity, outside a healthcare network, additional consents, encryptionprotocols, and removal of personally identifiable information may berequired to enable open sharing of the clinical data while staying incompliance with applicable regulations. Clinical data may includeelectronic health records, physician notes, medications, diagnoses andthe like.

Next, the process may require that models are available to analyze aclient's interaction. Initial datasets that include labeling data(confirmed or imputed diagnoses of depression) are fed to a series oftrainers that train individual models, and subsequently fuse them into acombined model (at 2920). Such training may also include personalizationof models when additional metadata is available.

FIG. 30 provides a greater detailed illustration of an example processfor such model training. As mentioned, label data is received (at 3010).Labels include a confirmed diagnosis of depression (or other healthcondition being screened for). Likewise, situations where the label maybe imputed or otherwise estimated are used to augment the training datasets.

Imputed label data is received by a manual review of a medical recordand/or interaction record with a given client. For example, inprediction mode, when the label is unknown, it is possible to decidewhether it is possible to estimate a label for a data point given otherinformation such as patient records, system predictions,clinically-validated surveys and questionnaires, and other clinicaldata. Due to the relative rarity of label data sets, and the need forlarge numbers of training samples to generate accurate models, it isoften important that the label data includes not just confirmed cases ofdepression, but also these estimated labels.

Additionally, the process includes receiving filtered data (at 3020).This data is filtered so that only data for which labels are known (orestimated) is used. Next each of the models is trained. Such trainingincludes training of the NLP model (at 3030), the acoustic model (at3040) the video model (at 3050) and the descriptive features (at 3060).It should be noted that these training processes occur in any order, orare trained in parallel. In some embodiments the parallel trainingincludes generating cross dependencies between the various models. Thesecross dependencies are one of the critical features that render thepresently disclosed systems and methods uniquely capable of renderingimproved and highly accurate classifications for a health condition.

The resulting trained models are fused, or aggregated, and the finalfused trained model may be stored (at 3070). The models (both individualand fused models) are stored in a model repository. However, it is alsodesirable to generate model variants that are customized to differentpopulation groups or even specific individuals (at 3080).

The process for model customization and personalization is explored infurther depth in relation to FIG. 31. Personalization relies uponmetadata stored with the filtered training data. This metadata isreceived (at 3081). Particular population segment features areidentified in the metadata and extracted out (at 3082). These segmentfeatures are used to train models that are specific to that segment.This is accomplished by clustering the filtered training data by thesesegmentation features (at 3083). A given training piece may be includedin a number of possible segments, each non-overlapping, or ofcontinually increasing granularity.

For example, assume labeled training data is received from a knownindividual. This individual is identified as a black woman in herseventies, in this example. This training data is then used to train formodels specific to African American individuals, African American women,women, elderly people, elderly women, elderly African American people,and elderly African American women. Thus, this single piece of trainingdata is used to generate seven different models, each with slightlydifferent scope and level of granularity. In situations where age isfurther divided out, this number of models being trained off of thisdata is increased even further (e.g., adult women, women over 50, womenover 70, individuals over 70, etc.). The models are then trained on thissegment-by-segment basis (at 3084). The customized models are annotatedby which segment(s) they are applicable to (at 3085), allowing for easyretrieval when a new response is received for classification whereinformation about the individual is known, and may be utilized to selectthe most appropriate/tailored model for this person.

This is important because, often, the model for one identifying a healthcondition in one individual may be wholly inadequate for classifyinganother individual. For example, a Caucasian person may requiredifferent video models compared to an individual of African descent.Likewise, men and women often have divergent acoustic characteristicsthat necessitate the leveraging of different acoustic models toaccurately classify them. Even a woman in her early twenties soundsdifferent than a woman in her fifties, which again differs from a womanin her eighties. NLP models for a native speaker, versus a secondlanguage speaker, may likewise be significantly different. Even betweengenerations, NLP models differ significantly to address differences inslang and other speech nuances. By making models available forindividuals at different levels of granularity, the most appropriatemodel may be applied, thereby greatly increasing classification accuracyby these models.

Returning to FIG. 30, after this personalization is completed, thecustomized models are also stored in the model repository, along withthe original models and fused models (at 3090). It should be noted thatwhile model customization generally increases classification accuracy,any such accuracy gains are jeopardized if a low number of trainingdatasets are available for the models. The system tracks the number oftraining data sets that are used to train any given customized model,and only models with sufficiently large enough training sets are labeledas ‘active’ within the model repository. Active models are capable ofbeing used by the runtime model server(s) 2010 for processing newlyreceived response data. Inactive models are merely stored untilsufficient data has been collected to properly train these models, atwhich time they are updated as being active.

Returning to FIG. 29, after model training, the process may engage withan interaction with a client (at 2930). This interaction may consist ofa question and answer style format, a free-flowing conversation, or evena topic prompt and the client providing a monologue style input.

FIG. 32 provides an example of this interaction process. Initially thesystem needs to be aware of the current state of the interaction (at3210) as well as the historical action that have been taken in theinteraction. A state machine and log of prior actions provides thiscontext. The process also receives user, clinical and social data (at3220). This data is used to extract user preference information (at3230). For example, preferences may be explicitly directed in the userdata, such as language preferences, topic of interest, or the like.Alternatively, these preferences are distilled from the clinical andsocial data. For example the social data provides a wealth ofinformation regarding the topics of interest for the user, and clinicaldata provides insight into any accessibility issues, or the like.

Additionally, the model results are received (at 3240), which are usedto analyze the user's responses (at 3250) and make decisions regardingthe adequacy of the data that has already been collected. For example,if it is determined via the model results that there is not yet a clearclassification, the interaction will be focused on collecting more datamoving forward. Alternatively, if sufficient data has been collected torender a confident classification, the interaction may instead befocused on a resolution. Additionally, the interaction management willsometimes receive direct command statements/navigational commands (at3260) from the user. These include actions such as repeating the lastdialogue exchange, increasing or decreasing the volume, rephrasing aquestion, a request for more time, a request to skip a topic, and thelike.

All this information is consumed by the action generator to determinethe best course of subsequent action (at 3270). The action is selectedfrom the question and adaptive action bank responsive to the currentstate (and prior history of the interaction) as well as any commands,preferences, and results already received. This may be completed using arule based engine, in some embodiments. For example, direct navigationalcommands may take precedence over alternative actions, but barring acommand statement by the user, the model responses may be checkedagainst the current state to determine if the state objective has beenmet. If so, an action is selected from the repository that meets anotherobjective that has not occurred in the history of the interaction. Thisaction is also modified based on preferences, when possible.Alternatively, the action selection is based on a machine learned model(as opposed to a rule based system).

The customized action is used to manage the interaction with the client,and also is used to update the current state and historical stateactivity (at 3280). The process checks if the goals are met, and if theinteraction should be concluded (at 3290). If not, then the entireprocess may be repeated for the new state and historical information, aswell as any newly received response data, navigational commands, etc.

Returning to FIG. 29, during interaction (and after interactioncompletion when required based upon processing demands) the clientresponse data is collected (at 2940). This data includes video/visualinformation as well as speech/audio information captured by the clientdevice's camera(s) and microphone(s), respectively. Although notdiscussed in great depth, the collected data may likewise includebiometric results via haptic interfaces or the like. The health state isthen classified using this collected response data (at 2950).

FIG. 33 provides a greater detail of the example process forclassification. The models are initially retrieved (at 3310) from themodel repository. The user data, social data, clinical data and speechand visual data are all provided to the runtime model server(s) forprocessing (at 3330). The inclusion of the clinical and/or social datasets the present screening or monitoring methodologies apart from priorscreening or monitoring methods.

This data is preprocessed to remove artifacts, noise and the like. Thepreprocessed data is also multiplexed into (at 3330). The preprocessedand multiplexed data is supplied to the models for analysis, as well asto third party ASR systems (at 3340). The ASR output may be consolidated(when multiple ASR systems are employed in concert), and the resultingmachine readable speech data is also provided to the models. The data isthen processed by the NLP model (at 3350 a), the acoustic model (at 3350b), the video model (at 3350 c) and for descriptive features (at 3350d). Each of the models operates in parallel, with results from any givenmodel being fed to the others to condition their operations. Adetermination is made if the modeling is complete (at 3360). Due to thefact that the model results are interdependent upon results of thealternative models, the process of modeling is cyclical, in some cases,whereby the models are conditioned (at 3370) with the results of theother models, and the modeling process repeats until a finalized resultis determined.

FIG. 34 describes the process of model conditioning in greater detail.Model conditioning essentially includes three sub-processes operating inparallel, or otherwise interleaved. These include the configuration ofthe NLP model using the results of the acoustic model and video model,in addition to the descriptive features (at 3371), the configuration ofthe acoustic model using the results of the NLP model and video model,in addition to the descriptive features (at 3372), and configuration ofthe video model using the results of the acoustic model and NLP model,in addition to the descriptive features (at 3373). As previouslydiscussed, this conditioning is not a clearly ordered process, asintermediate results from the acoustic model for example may be used tocondition the NLP model, the output of which may influence the videomodel, which then in turn conditions the acoustic model, requiring theNLP model to be conditioned based upon updated acoustic model results.This may lead to looped computing processes, wherein each iteration theresults are refined to be a little more accurate than the previousiteration. Artificial cutoffs are imposed in such computational loops toavoid infinite cycling and breakdown of the system due to resourcedrain. These cutoffs are based upon number of loop cycles, or upon thedegree of change in a value between one loop cycle and the next. Overtime, the results from one loop cycle to the next become increasinglycloser to one another. At some point additional looping cycles are notdesired due to the diminishing returns to the model accuracy for theprocessing resources spent.

One example of this kind of conditioning is when the NLP modeldetermines that the user is not speaking. This result is used by thevideo model to process the individuals facial features based upon mouthbounding and eye bounding. However, when the user is speaking, the videomodel uses this result to alter the model for emotional recognition torely less upon the mouth regions of the user and rather rely upon theeye regions of the user's face. This is but a single simplified exampleof one type of model conditioning, as is not limiting.

Returning to FIG. 33, after modeling is completed, each model is thencombined (fused) by weighting the classification results by the timedomains (at 3380). This sub process is described in greater detail inrelation to FIG. 35. As noted before, sometimes one model is relied uponmore heavily than another model due to the classification confidence, orbased upon events in the response. The clearest example of this is thatif there is a period of time in which the user is not speaking, then theNLP model classification for this time period should be minimized,whereas the weights for video modeling and acoustic modeling should beafforded a much larger weight. Likewise, if two models are suggestingthat the third model is incorrect or false, due to dishonesty or someother dissonance, then the odd model's classification may also beweighted lower than the other models accordingly.

In FIG. 35, this weighting process involves starting with a base weightfor each model (at 3381). The response is then divided up into discretetime segments (at 3382). The length of these time segments isconfigurable, and in one embodiment, they are set to a three secondvalue, as most spoken concepts are formed in this length of time. Thebase weights for each of the models are then modified based upon modelconfidence levels, for each time period (at 3383). For example, if theNLP model is classified as being 96% confident during the first sixseconds, but only 80% confident in the following twelve seconds, ahigher weight will be applied to the first two time periods, and a lowerweight for the following four time periods.

The system also determines when the user is not speaking, generally byrelying upon the ASR outputs (at 3384). During these periods the NLPmodel is not going to be useful in determining the user'sclassification, and as such the NLP model weights are reduced for thesetime periods (at 3385). The degree of reduction may differ based uponconfiguration, but in some embodiments, the NLP is afforded no weightfor periods when the user is not speaking.

Likewise, periods where the patient exhibits voice-based biomarkersassociated with being dishonest may also be identified, based uponfeatures and conclusions from the video and acoustic models (at 3386).Excessive fidgeting, shifting gaze, higher pitch and mumbling may all becorrelated with dishonesty, and when multiple features aresimultaneously present, the system flags these periods of theinteraction as being suspect. During such time periods the NLP modelweights is again reduced (at 3387), but only marginally. Even when auser is not being entirely honest, there is still beneficial informationcontained in the words they speak, especially for depression diagnosis.For example, even if a user is being dishonest about having suicidalthoughts (determined by sematic analysis) syntactical features may stillbe valid in determining the user's classification. As such, duringperiods of dishonesty, while the weight is tempered, the reduction isgenerally a quarter reduction in weight as opposed to a more steepweight reduction.

After all the weight adjustments have been made, the system performs aweighted average, over the entire response time period, of the models'classification results (at 3388). The final result of this condensationof the classifications over time and across the different componentmodels results in the fused model output.

Returning to FIG. 33, this fused model output generates a finalclassification (at 3390) for the interaction. This classification, modelresults, and features are then output in aggregate or in part (at 3399).Returning to FIG. 29, these results are then presented to the client andother interested stakeholders (at 2960). This may include selectingwhich results any given entity should receive. For example, a client maybe provided only the classification results, whereas a physician for theclient will receive features relating to mood, topics of concern,indications of self-harm or suicidal thoughts, and the like. Incontrast, an insurance company will receive the classification results,and potentially a sampling of the clinical data as it pertains to theindividual's risk factors.

Even after reporting out classification results, the process continuesby collecting new information as it becomes available, re-trainingmodels to ensure the highest levels of accuracy, and subsequentinteractions and analysis of interaction results.

Turning now to FIG. 36, one example substantiation of an acousticmodeling process 3350 b is presented in greater detail. It should benoted, that despite the enhanced detail in this example process, this isstill a significant simplification of but one of the analysismethodologies, and is intended purely as an illustrative process for thesake of clarity, and does not limit the analyses that are performed onthe response data.

In this example process, a variable cutoff value is determined from thetraining datasets (at 3605). The acoustic signal that is received, inthis particular analysis, is converted into a spectrogram image (at3610), which provides information on the frequency of the audio signaland the amplitude at each of these frequencies. This image also tracksthese over time. In this example process, a sample of the spectrogramimage is taken that corresponds to a set length of time (at 3615). Insome cases, this may be a ten second sample of the spectrogram data.

The image is converted into a matrix. This matrix is used in an equationto represent a higher order feature. The equation is developed from thetraining data utilizing machine learning techniques. The equationincludes unknown variables, in addition to the input matrix of the highorder feature (here the spectrogram image sample). These unknownvariables are multiplied, divided, added or subtracted from the featurematrix (or any combination thereof). The solution to the equation isalso known, resulting in the need to randomly select values for theunknown variables (at 3620) in an attempt to solve the equation (at3630) and get a solution that is similar to the known solution.

The difference between the solved equation values is compared to theknown solution value in order to calculate the error (at 3630). Thisprocess is repeated thousands or even millions of times until a closeapproximation of the correct variable values are found, as determined bya sufficiently low error calculation (at 3635). Once these sufficientlyaccurate values are found, they are compared against the cutoff valuesthat were originally determined from the training data (at 3640). If thevalues are above or below the cutoffs, this indicates the existence orabsence of the classification, based on the equation utilized. In thismanner the classification for the spectrogram analysis may be determined(at 3645), which may be subsequently output (at 3650) for incorporationwith the other model results.

Modeling system logic 5320 includes speech recognition 2210 (FIG. 22),which is shown in greater detail in (FIG. 37). Speech recognition isspecific to the particular language of the speech. Accordingly, speechrecognition 2210 includes language-specific speech recognition 3702,which in turn includes a number of language-specific speech recognitionengines 3706A-Z. The particular languages of language-specific speechrecognition engines 3706A-Z shown in (FIG. 14)7) are merely illustrativeexamples.

Speech recognition 2210 also includes a translation engine 3704. Supposefor example that the patient speaks a language that is recognized by anyof language-specific speech recognition engines 3706A-Z but is notprocessed by language models 2214 (FIG. 22). Language-specific speechrecognition 3702 (FIG. 37) produces text in the language spoken by thepatient, i.e., the patient's language, from the audio signal receivedfrom the patient. To enable application of language models 2214, whichcannot process text in the patient's language in this illustrativeexample, translation engine 3704 translates the text from the patient'slanguage to a language that may be processed by language models 2214,e.g., English. While language models 2214 may not be as accurate whenrelying on translation by translation engine 3704, accuracy of languagemodels 2214 is quite good with currently available translationtechniques. In addition, the importance of language models 2214 isdiluted significantly by the incorporation of acoustic models 2218,visual models 2222, and clinical data 2220 in the creation of compositemodel 2204. As a result, composite model 2204 is extremely accuratenotwithstanding reliance on translation engine 3704.

Modeling system logic 5320 includes language model training 2212 (FIG.22) and language models 2214, which are shown in greater detail in FIGS.10 and 11, respectively. Language model training 2212 (FIG. 38) includeslogic for training respective models of language models 2214. Forexample, language model training 2212 (FIG. 38) includes syntacticlanguage model training 3802, semantic pattern model training 3804,speech fluency model training 3806, and non-verbal model training 3808which include logic for training syntactic language model 3902, semanticpattern model 3904, speech fluency model 3906, and non-verbal model3908, respectively, of language models 2214.

Each of models 3902-3908 includes deep learning (also known as deepstructured learning or hierarchical learning) logic that assesses thepatient's depression from text received from speech recognition 2210.

Syntactic language model 3902 assesses a patient's depression fromsyntactic characteristics of the patient's speech. Examples of suchsyntactic characteristics include sentence length, sentence completion,sentence complexity, and negation. When a patient speaks in shortersentences, fails to complete sentences, speaks in simple sentences,and/or uses relatively frequent negation (e.g., “no”, “not”, “couldn't”,“won't”, etc.), syntactic language model 3902 determines that thepatient is more likely to be depressed.

Semantic pattern model 3904 assesses a patient's depression frompositive and/or negative content of the patient's speech—i.e., fromsentiments expressed by the patient. Some research suggests thatexpression of negative thoughts may indicate depression and expressionof positive thoughts may counter-indicate depression. For example, “thecommute here was awful” may be interpreted as an indicator fordepression while “the commute here was awesome” may be interpreted as acounter-indicator for depression.

Speech fluency model 3906 assesses a patient's depression from fluencycharacteristics of, i.e., the flow of, the patient's speech. Fluencycharacteristics may include, for example, word rates, the frequency andduration of pauses in the speech, the prevalence of filler expressionssuch as “uh” or “umm”, and packet speech patterns. Some researchsuggests that lower word rates, frequent and/or long pauses in speech,and high occurrence rates of filler expressions may indicate depression.Perhaps more so than others of language models 2214, speech fluencymodel 3906 may be specific to the individual patient. For example, ratesof speech (word rates) vary widely across geographic regions. The normalrate of speech for a patient from New York City may be significantlygreater than the normal rate of speech for a patient from Minnesota.

Non-verbal model 3908 assesses a patient's depression from non-verbalcharacteristics of the patient's speech, such as laughter, chuckles, andsighs. Some research suggests that sighs may indicate depression whilelaughter and chuckling (and other forms of partially repressed laughtersuch as giggling) may counter-indicate depression.

Modeling system logic 5320 includes acoustic model training 2216 (FIG.22) and acoustic models 2214, which are shown in greater detail in FIGS.12 and 13, respectively. Acoustic model training 2216 (FIG. 40) includeslogic for training respective models of acoustic models 2218 (FIG. 41).For example, acoustic model training 2216 (FIG. 40) includespitch/energy model training 4002, quality/phonation model training 4004,speaking flow model training 4006, and articulatory coordination modeltraining 4008 which include logic for training pitch/energy model 4102,quality/phonation pattern model 4104, speaking flow model 4106, andarticulatory coordination model 1308, respectively, of acoustic models2218.

Each of models 4102-4108 includes deep learning (also known as deepstructured learning or hierarchical learning) logic that assesses thepatient's depression from audio signals representing the patient'sspeech as received from collected patient data 2206 (FIG. 22) andpreprocessing 2208.

Pitch/energy model 4102 assesses a patient's depression from pitch andenergy of the patient's speech. Examples of energy include loudness andsyllable rate, for example. When a patient speaks with a lower pitch,more softly, and/or more slowly, pitch/energy model 4102 determines thatthe patient is more likely to be depressed.

Quality/phonation model 4104 assesses a patient's depression from voicequality and phonation aspects of the patient's speech. Different voicesource modifications may occur in depression and affect the voicingrelated aspects of speech, both generally and for specific speechsounds.

Speaking flow model 4106 assesses a patient's depression from the flowof the patient's speech. Speaking flow characteristics may include, forexample, word rates, the frequency and duration of pauses in the speech,the prevalence of filler expressions such as “uh” or “umm”, and packetspeech patterns.

Articulatory coordination model 4108 assesses a patient's depressionfrom articulatory coordination in the patient's speech. Articulatorycoordination refers to micro-coordination in timing, among articulatorsand source characteristics. This coordination becomes worse when thepatient is depressed.

Modeling system logic 5320 (FIG. 53) includes visual model training 2224(FIG. 22) and visual models 2226, which are shown in greater detail inFIGS. 14 and 15, respectively. Visual model training 2224 (FIG. 42)includes logic for training respective models of visual models 226 (FIG.53). For example, visual model training 2224 (FIG. 42) includes facialcue model training 4202 and eye/gaze model training 4204 which includelogic for training facial cue model 4302 and eye/gaze model 4304,respectively, of visual models 2226.

Each of models 4302-4304 includes deep learning (also known as deepstructured learning or hierarchical learning) logic that assesses thepatient's depression from video signals representing the patient'sspeech as received from collected patient data 2206 (FIG. 22) andpreprocessing 2208.

Facial cue model 4302 assesses a patient's depression from facial cuesrecognized in the video of the patient's speech. Eye/gaze model 4304assesses a patient's depression from observed and recognized eyemovements in the video of the patient's speech.

As described above, composite model builder 2222 (FIG. 22) buildscomposite model 2204 by combining language models 2214, acoustic models2218, and visual models 2226 and training the combined model using bothclinical data 2220 and collected patient data 2206. As a result,composite model 2204 assesses depression in a patient using what thepatient says, how the patient says it, and contemporaneous facial andeye expressions in combination. Such provides a particularly accurateand effective tool for assessing the patient's depression.

The above description is illustrative only and is not limiting. Forexample, while the particular mental health condition addressed by thesystem and methods as described herein, it should be appreciated thatthe techniques described herein may effectively assess and/or screen fora number of other mental health conditions such as anxiety,post-traumatic stress disorder (PTSD) and stress generally, drug andalcohol addiction, bipolar disorder, among others. In addition, whileassessment test administrator 2202 is described as assessing the mentalhealth of the human subject, who may be a patient, it is appreciatedthat “assessment” sometimes refers to professional assessments made byprofessional clinicians. As used herein, the assessment provided byassessment test administrator 2202 may be any type of assessment in thegeneral sense, including screening or monitoring.

Scoring

The models described herein may produce scores, at various stages of anassessment. The scores produced may be scaled scores or binary scores.Scaled scores may range over a large number of values, while binaryscores may be one of two discrete values. The system disclosed mayinterchange binary and scaled scores at various stages of theassessment, to monitor different mental states, or update particularbinary scores and particular scaled scores for particular mental statesover the course of an assessment.

The scores produced by the system, either binary or scaled, may beproduced after each response to each query in the assessment, or may beformulated in part based on previous queries. In the latter case, eachmarginal score acts to fine-tune a prediction of depression, or ofanother mental state, as well as to make the prediction more robust.Marginal predictions may increase confidence measures for predictions ofmental states in this way, after a particular number of queries andresponses (correlated with a particular intermediate mental state)

For scaled scores, the refinement of the score may allow clinicians todetermine, with greater precision, seventies of one or more mentalstates the patient is experiencing. For example, the refinement of thescaled score, when observing multiple intermediate depression states,may allow a clinician to determine whether the patient has mild,moderate, or severe depression. Performing multiple scoring iterationsmay also assist clinicians and administrators in removing falsenegatives, by adding redundancy and adding robustness. For example,initial mental state predictions may be noisier, because relativelyfewer speech segments are available to analyze, and NLP algorithms maynot have enough information to determine semantic context for thepatient's recorded speech. Even though a single marginal prediction mayitself be a noisy estimate, refining the prediction by adding moremeasurements may reduce the overall variance in the system, yielding amore precise prediction. The predictions described herein may be moreactionable than those which may be obtained by simply administering asurvey, as people may have incentive to lie about their conditions.Administering a survey may yield high numbers of false positive andfalse negative results, enabling patients who need treatment to slipthrough the cracks. In addition, although trained clinicians may noticevoice and face-based biomarkers, they may not be able to analyze thelarge amount of data the system disclosed is able to analyze.

The scaled score may be used to describe a severity of a mental state.The scaled score may be, for example, a number between 1 and 5, orbetween 0 and 100, with larger numbers indicating a more severe or acuteform of the patient's experienced mental state. The scaled score mayinclude integers, percentages, or decimals. Conditions for which thescaled score may express severity may include, but are not limited todepression, anxiety, stress, PTSD, phobic disorder, and panic disorder.In one example, a score of 0 on a depression-related aspect of anassessment may indicate no depression, a score of 50 may indicatemoderate depression, and a score of 100 may indicate severe depression.The scaled score may be a composition of multiple scores. A mental statemay be expressed as a composition of mental sub-states, and a patient'scomposite mental state may be a weighted average of individual scoresfrom the mental sub-states. For example, a composition score ofdepression may be a weighted average of individual scores for anger,sadness, self-image, self-worth, stress, loneliness, isolation, andanxiety.

A scaled score may be produced using a model that uses a multilabelclassifier. This classifier may be, for example, a decision treeclassifier, a k-nearest neighbors' classifier, or a neural network-basedclassifier. The classifier may produce multiple labels for a particularpatient at an intermediate or final stage of assessment, with the labelsindicating severities or extents of a particular mental state. Forexample, a multilabel classifier may output multiple numbers, which maybe normalized into probabilities using a softmax layer. The label withthe largest probability may indicate the severity of the mental stateexperienced by the patient.

The scaled score may also be determined using a regression model. Theregression model may determine a fit from training examples that areexpressed as sums of weighted variables. The fit may be used toextrapolate a score from a patient with known weights. The weights maybe based in part on features, which may be in part derived from theaudiovisual signal (e.g., voice-based biomarkers) and in part derivedfrom patient information, such as patient demographics. Weights used topredict a final score or an intermediate score may be taken fromprevious intermediate scores.

The scaled score may be scaled based on a confidence measure. Theconfidence measure may be determined based on recording quality, type ofmodel used to analyze the patient's speech from a recording (e.g.,audio, visual, semantic), temporal analysis related to which model wasused most heavily during a particular period of time, and the point intime of a specific voice-based biomarker within an audiovisual sample.Multiple confidence measures may be taken to determine intermediatescores. Confidence measures during an assessment may be averaged inorder to determine a weighting for a particular scaled score.

The binary score may reflect a binary outcome from the system. Forexample, the system may classify a user as being either depressed or notdepressed. The system may use a classification algorithm to do this,such as a neural network or an ensemble method. The binary classifiermay output a number between 0 and 1. If a patient's score is above athreshold (e.g., 0.5), the patient may be classified as “depressed.” Ifthe patient's score is below the threshold, the patient may beclassified as “not depressed.” The system may produce multiple binaryscores for multiple intermediate states of the assessment. The systemmay weight and sum the binary scores from intermediate sates of theassessment in order to produce an overall binary score for theassessment.

The outputs of the models described herein can be converted to acalibrated score, e.g., a score with a unit range. The outputs of themodels described herein can additionally or alternatively be convertedto a score with a clinical value. A score with a clinical value can be aqualitative diagnosis (e.g., high risk of severe of depression). A scorewith a clinical value can alternatively be a normalized, qualitativescore that is normalized with respect to the general population or aspecific sub-population of patients. The normalized, qualitative scoremay indicate a risk percentage relative to the general population or tothe sub-population.

The systems described herein may be able to identify a mental state of asubject (e.g., a mental disorder or a behavioral disorder) with lesserror (e.g., 10% less) or a higher accuracy (e.g., 10% more) than astandardized mental health questionnaire or testing tool. The error rateor accuracy may be established relative to a benchmark standard usableby an entity for identifying or assessing one or more medical conditionscomprising said mental state. The entity may be a clinician, ahealthcare provider, an insurance company, or a government-regulatedbody. The benchmark standard may be a clinical diagnosis that has beenindependently verified.

Confidence

The models described herein may use confidence measures. A confidencemeasure may be a measure of how effective the score produced by themachine learning algorithm may be in order of accurately predicting amental state, such as depression. A confidence measure may depend onconditions under which the score was taken. A confidence measure may beexpressed as a whole number, a decimal, or a percentage. Conditions mayinclude a type of recording device, an ambient space in which signalswere taken, background noise, patient speech idiosyncrasies, languagefluency of a speaker, the length of responses of the patient, anevaluated truthfulness of the responses of the patient, and frequency ofunintelligible words and phrases. Under conditions where the quality ofthe signal or speech makes it more difficult for the speech to beanalyzed, the confidence measure may have a smaller value. In someembodiments, the confidence measure may be added to the scorecalculation, by weighting a calculated binary or scaled score with theconfidence measure. In other embodiments, the confidence measure may beprovided separately. For example, the system may tell a clinician thatthe patient has a 0.93 depression score with 75% confidence.

The confidence level may also be based on the quality of the labels ofthe training data used to train the models that analyze the patient'sspeech. For example, if the labels are based on surveys orquestionnaires completed by patients rather than official clinicaldiagnoses, the quality of the labels may be determined to be lower, andthe confidence level of the score may thus be lower. In some cases, itmay be determined that the surveys or questionnaires have a certainlevel of untruthfulness. In such cases, the quality of the labels may bedetermined to be lower, and the confidence level of the score may thusbe lower.

Various measures may be taken by the system in order to improve aconfidence measure, especially where the confidence measure is affectedby the environment in which the assessment takes place. For example, thesystem may employ one or more signal processing algorithms to filter outbackground noise, or use impulse response measurements to determine howto remove effects of reverberations caused by objects and features ofthe environment in which the speech sample was recorded. The system mayalso use semantic analysis to find context clues to determine theidentities of missing or unintelligible words.

In addition, the system may use user profiles to group people based ondemeanor, ethnic background, gender, age, or other categories. Becausepeople from similar groups may have similar voice-based biomarkers, thesystem may be able to predict depression with higher confidence, aspeople who exhibit similar voice-based biomarkers may indicatedepression in similar manners.

For example, depressed people from different backgrounds may bevariously categorized by slower speech, monotone pitch or low pitchvariability, excessive pausing, vocal timbre (gravelly or hoarsevoices), incoherent speech, rambling or loss of focus, terse responses,and stream-of-consciousness narratives. These voice-based biomarkers maybelong to one or more segments of patients analyzed.

Screening system data store 410 (shown in greater detail in FIG. 44)stores and maintains all user and patient data needed for, and collectedby, screening or monitoring in the manner described herein. Screeningsystem data store 410 includes data store logic 4402, label estimationlogic 4404, and user and patient databases 4406. Data store logic 4402controls access to user and patient databases 4406. For example, datastore logic 4402 stores audiovisual signals of patients' responses andprovides patient clinical history data upon request. If the requestedpatient clinical history data is not available in user and patientdatabases 4406, data store logic 4402 retrieves the patient clinicalhistory data from clinical data server 106. If the requested patientsocial history data is not available in user and patient databases 4406,data store logic 4402 retrieves the patient social history data fromsocial data server 108. Users who are not patients include health careservice providers and payers.

Social media server 108 may include a wide variety of patient/subjectdata including but not limited to retail purchasing records, legalrecords (including criminal records), income history, as these mayprovide valuable insights to a person's health. In many instances, thesesocial determinants of disease contribute more to a person's morbiditythan medical care. Appendix B depicts a “Health Policy Brief: TheRelative Contributions of Multiple Determinants to Health Outcomes”.

Label estimation logic 4404 includes logic that specifies labels forwhich the various learning machines of health screening or monitoringserver 102 screen. Label estimation logic 4404 includes a user interfacethrough which human operators of health screening or monitoring server102 may configure and tune such labels.

Label estimation logic 4404 also controls quality of model training by,inter alia, determining whether data stored in user and patientdatabases 4406 is of adequate quality for model training. Labelestimation logic 4404 includes logic for automatically identifying ormodifying labels. In particular, if model training reveals a significantdata point that is not already identified as a label, label estimationlogic 4404 looks for correlations between the data point and patientrecords, system predictions, and clinical insights to automaticallyassign a label to the data point.

While interactive screening or monitoring server logic 502 is describedas conducting an interactive, spoken conversation with the patient toassess the health state of the patient, interactive screening ormonitoring server logic 502 may also act in a passive listening mode. Inthis passive listening mode, interactive screening or monitoring serverlogic 502 passively listens to the patient speaking without directingquestions to be asked of the patient.

Passive listening mode, in this illustrative embodiment, has two (2)variants. In the first, “conversational” variant, the patient is engagedin a conversation with another whose part of the conversation is notcontrolled by interactive screening or monitoring server logic 502.Examples of conversational passive listening include a patient speakingwith a clinician and a patient speaking during a telephone callreminding the patient of an appointment with a clinician or discussingmedication with a pharmacist. In the second, “fly-on-the-wall” (FOTW) or“ambient” variant, the patient is speaking alone or in a public, orsemi-public, place. Examples of ambient passive listening include peoplespeaking in a public space or a hospital emergency room and a personspeaking alone, e.g., in an audio diary or leaving a telephone message.One potentially useful scenario for screening or monitoring a personspeaking alone involves interactive screening or monitoring server logic502 screening or monitoring calls to police emergency services (i.e.,“9-1-1”). Analysis of emergency service callers may distinguish trulyurgent callers from less urgent callers.

It should be noted that this detailed description is intended todescribe what is technologically possible. Practicing the techniquesdescribed herein should comply with legal requirements and limitationsthat may vary from jurisdiction to jurisdiction, including federalstatutes, state laws, and/or local ordinances. For example, somejurisdictions may require explicit notice and/or consent of involvedperson(s) prior to capturing their speech. In addition, acquisition,storage, and retrieval of clinical records should be practiced in amanner that is in compliance with applicable jurisdictionalrequirement(s).

Patient screening or monitoring system 100B (FIG. 45) illustrates apassive listening variation of patient screening or monitoring system100 (FIG. 1). Patient screening or monitoring system 100B (FIG. 45)includes health screening or monitoring server 102, a clinical dataserver 106, and a social data server 108, which are as described aboveand, also as described above, connected to one another through WAN 110.

Since the patient and the clinician are in close physical proximity toone another in conversational passive listening, the remainder of thecomponents of patient screening or monitoring system 110B are connectedto one another and WAN 110 through a local area network (LAN) 4510.

There are a number of ways to distinguish the patient's voice from theclinician's.

A particularly convenient one is to have two (2) separate listeningdevices 4512 and 4514 for the patient and clinician, respectively. Inthis illustrative embodiment, listening devices 4512 and 4514 are smartspeakers, such as the HomePod™ smart speaker available from AppleComputer of Cupertino, Calif., the Google Home™ smart speaker availablefrom Google LLC of Mountain View, Calif., and the Amazon Echo™ availablefrom Amazon.com, Inc. of Seattle, Wash. In other embodiments, listeningdevices 4512 and 4514 may be other types of listening devices such asmicrophones coupled to clinician device 114B, for example.

In some embodiments, a single listening device 4514 is used andscreening or monitoring server 102 distinguishes between the patient andthe clinician using conventional voice recognition techniques. Accuracyof such voice recognition may be improved by training screening ormonitoring server 102 to recognize the clinician's voice prior to anysession with a patient. While the following description refers to aclinician as speaking to the patient, it should be appreciated that theclinician may be replaced with another. For example, in a telephone callmade to the patient by a health care office administrator, e.g., supportstaff for a clinician, the administrator takes on the clinician's roleas described in the context of conversational passive listening.Similarly, in a telephone call made by a pharmacy to a patient regardingprescriptions, the person or automated machine caller calling on behalfof the pharmacy takes on this clinician role as described herein.Appendix C depicts an exemplary Question Bank for some of theembodiments in accordance with the present invention.

Processing by interactive health screening or monitoring logic 402,particularly generalized dialogue flow logic 602 (FIG. 7), inconversational passive listening is illustrated by logic flow diagram4600 (FIG. 46). FIG. 46 shows an instantiation of a dynamic mode, inwhich query content is analyzed in real-time. Loop step 4602 and nextstep 4616 define a loop in which generalized dialogue flow logic 602processes audiovisual signals of the conversation between the patientand the clinician according to steps 4604-4614. While steps 4604-4614are shown as discrete, sequential steps, they are performed concurrentlywith one another in an ongoing basis by generalized dialogue flow logic602. The loop of steps 4602-4616 is initiated and terminated by theclinician using conventional user interface techniques, e.g., usingclinician device 114B (FIG. 45) or listening device 4514.

In step 4604 (FIG. 46), generalized dialogue flow logic 602 recognizes aquestion to the patient posed by the clinician and sends the question toruntime model server logic 504 for processing and analysis. Generalizeddialogue flow logic 602 receives results 1820 for the audiovisual signalof the clinician's utterance, and results 1820 (FIG. 18) include atextual representation of the clinician's utterance from ASR logic 1804along with additional information from descriptive model and analytics1812. This additional information includes identification of the variousparts of speech of the words in the clinician's utterance.

In step 4606 (FIG. 46), generalized dialogue flow logic 602 identifiesthe most similar question in question and dialogue action bank 710 (FIG.7). If the question recognized in step 4604 is not identical to anyquestions stored in question and dialogue action bank 710, generalizeddialogue flow logic 602 may identify the nearest question in the mannerdescribed above with respect to question equivalence logic 1104 (FIG.11) or may identify the question in question and dialogue action bank710 (FIG. 7) that is most similar linguistically.

In step 4608 (FIG. 46), generalized dialogue flow logic 602 retrievesthe quality of the nearest question from question and dialogue actionbank 710, i.e., quality 908 (FIG. 9).

In step 4610 (FIG. 46), generalized dialogue flow logic 602 recognizesan audiovisual signal representing the patient's response to thequestion recognized in step 4604.

The patient's response is recognized as an utterance of the patientimmediately following the recognized question. The utterance may berecognized as the patient's by (i) determining that the voice iscaptured more loudly by listening device 4512 than by listening device4514 or (ii) determining that the voice is distinct from a voicepreviously established and recognized as the clinician's.

In step 4612, generalized dialogue flow logic 602 sends the patient'sresponse, along with the context of the clinician's correspondingquestion, to runtime model server logic 504 for analysis and evaluation.The context of the clinician's question is important, particularly ifthe semantics of the patient's response is unclear in isolation. Forexample, consider that the patient's answer is simply “Yes.” Thatresponse is analyzed and evaluated very differently in response to thequestion “Were you able to find parking?” versus in response to thequestion “Do you have thoughts of hurting yourself?”

In step 4614, generalized dialogue flow logic 602 reports intermediateanalysis received from results 1820 to the clinician. In instances inwhich the clinician is using clinician device 114B during theconversation, e.g., to review electronic health records of the patient,the report may be in the form of animated gauges indicating intermediatescores related to a number of health states. Examples of animated gaugesinclude steam gauges, i.e., round dial gauges with a moving needle, anddynamic histograms such as those seen on audio equalizers in soundsystems.

Upon termination of the conversational passive listening by theclinician, processing according to the loop of steps 4602-4616completes. In step 4618, interactive screening or monitoring serverlogic 502 sends final analysis of the conversation to the clinician.Generally, in the context of step 4618, the “clinician” is always amedical health professional or health records of the patient.

Thus, health screening or monitoring server 102 may screen patients forany of a number of health states passively during a conversation thepatient may engage in regardless without requiring a separate, explicitscreening or monitoring interview of the patient.

In ambient passive listening, health screening or monitoring server 102listens to and processes ambient speech according to logic flow diagram4700 (FIG. 47). Processing by interactive health screening or monitoringlogic 402, particularly generalized dialogue flow logic 602 (FIG. 7), inambient passive listening is illustrated by logic flow diagram 4700(FIG. 47). Loop step 4702 and next step 4714 define a loop in whichgeneralized dialogue flow logic 602 processes audiovisual signals ofambient speech according to steps 4704-4712. While steps 4704-4714 areshown as discrete, sequential steps, they are performed concurrentlywith one another in an ongoing basis by generalized dialogue flow logic602. The loop of steps 4702-4714 is initiated and terminated by a humanoperator of the listening device(s) involved, e.g., listening device4514.

In step 4704 (FIG. 47), generalized dialogue flow logic 602 capturesambient speech. In test step 4708, interactive screening or monitoringserver logic 502 determines whether the speech captured in step 4704 isspoken by a voice that is to be analyzed. In ambient passive listeningin areas that are at least partially controlled, many people likely tospeak in such areas may be registered with health screening ormonitoring server 102 such that their voices may be recognized. Inschools, students may have their voices registered with health screeningor monitoring server 102 at admission.

In some embodiments, the people whose voices are to be analyzed areadmitted students that are recognized by generalized dialogue flow logic602. In hospitals, hospital personnel may have their voices registeredwith health screening or monitoring server 102 at hiring. In addition,patients in hospitals may register their voices at first contact, e.g.,at an information desk or by hospital personnel in an emergency room. Insome embodiments, hospital personnel are excluded from analysis whenrecognized as the speaker by generalized dialogue flow logic 602.

In an emergency room environment in which analysis of voices unknown togeneralized dialogue flow logic 602 is important, generalized dialogueflow logic 602 may still track speaking by unknown speakers. Multipleutterances may be recognized by generalized dialogue flow logic 602 asemanating from the same individual person. Health screening ormonitoring server 102 may also determine approximate positions ofunknown speakers in environments with multiple listening devices, e.g.,by triangulation using different relative amplitudes and/or relativetiming of arrival of the captured speech at multiple listening devices.

In other embodiments of ambient passive listening in which only oneperson speaks, the speaker may be asked to identify herself.Alternatively, in some embodiments, the identity of the speaker may beinferred or is not especially important. In an audio diary, the speakermay be authenticated by the device or may be assumed to be used by thedevice's owner. In police emergency telephone call triage, the identityof the caller is not as important as the location of the speaker andqualities of the speaker's voice such as emotion, energy, and thesubstantive content of the speaker's speech.

In these embodiments in which only one person speaks, generalizeddialogue flow logic 602 always determines that the speaker is to beanalyzed.

If the speaker is not to be analyzed, generalized dialogue flow logic602 sends the captured ambient speech to runtime model server logic 504for processing and analysis for context. Generalized dialogue flow logic602 receives results 1820 for the audiovisual signal of the capturedspeech, and results 1820 (FIG. 18) include a textual representation ofthe captured speech from ASR logic 1804 along with additionalinformation from descriptive model and analytics 1812. This additionalinformation includes identification of the various parts of speech ofthe words in the clinician's utterance. Generalized dialogue flow logic602 processes results 1820 for the captured speech to establish acontext.

After step 4708 (FIG. 47), processing transfers through next step 4714to loop step 4702 and passive listening accord to the loop of steps4702-4714 continues.

If in test step 4706, interactive screening or monitoring server logic502 determines that the speech captured in step 4704 is spoken by avoice that is to be analyzed, processing transfers to step 4710. In step4710, generalized dialogue flow logic 602 sends the captured speech,along with any context determined in prior yet contemporary performancesof step 4708 or step 4710, to runtime model server logic 504 foranalysis and evaluation.

In step 4712, generalized dialogue flow logic 602 processes any alertstriggered by the resulting analysis from runtime model server logic 504according predetermined alert rules. These predetermined alert rules areanalogous to work-flows 4810 described below. In essence, thesepredetermined alert rules are in the form of if-then-else logic elementsthat specify logical states and corresponding actions to take in suchstates.

The following are examples of alert rules that may be implemented byinteractive screening or monitoring server logic 502. In a policeemergency system call in which the caller, speaking initially to anautomated triage system, whose speech is determined to be highlyemotional and anxious and to semantically describe a highly urgentsituation, e.g., a car accident with severe injuries, a very highpriority may be assigned to the call and taken ahead of less urgentcallers. In a school hallway in which interactive screening ormonitoring server logic 502 recognizes frantic speech and screaming andsemantic content describing the presence of weapon and/or blatant actsof violence, interactive screening or monitoring server logic 502 maytrigger immediate notification of law enforcement and school personnel.In an audio diary in which a patient is detected to be at leastmoderately depressed, interactive screening or monitoring server logic502 may record the analysis in the patient's clinical records such thatthe patient's behavioral health care provider may discuss the diaryentry when the patient is next seen. In situations in which thetriggering condition of the captured speech is particularly serious andurgent, interactive screening or monitoring server logic 502 may reportthe location of the speaker if it may be determined.

Processing according to the loop of steps 4702-4714 (FIG. 47) continuesuntil stopped by a human operator of interactive screening or monitoringserver logic 502 or of the involved listening devices.

Thus, health screening or monitoring server 102 may screen patients forany of a number of health states passively outside the confines of aone-to-one conversation with a health care professional.

As described above with respect to FIG. 4, health care management logic408 makes expert recommendations in response to health state analysis ofinteractive health screening or monitoring logic 402. Health caremanagement logic 408 is shown in greater detail in FIG. 68.

Health care management logic 408 includes manual work-flow managementlogic 4802, automatic work-flow generation logic 4804, work-flowexecution logic 4806, and work-flow configuration 4808. Manual work-flowmanagement logic 4802 implements a user interface through which a humanadministrator may create, modify, and delete work-flows 4810 ofwork-flow configuration 4808 by physical manipulation of one or moreuser input devices of a computer system used by the administrator.Automatic work-flow generation logic 4804 performs statistical analysisof patient data stored within screening or monitoring system data store410 to identify work-flows to achieve predetermined goals. Examples ofsuch goals include things like minimizing predicted costs for the nexttwo (2) years of a patient's care and minimizing the cost of an initialreferral while also maximizing a reduction in Hemoglobin A1C in oneyear.

Work-flow execution logic 4806 processes work-flows 4810 of work-flowconfiguration 4808, evaluating conditions and performing actions ofwork-flow elements 4820.

In some embodiments, work-flow execution logic 4806 processes work-flows4810 in response to receipt of final results of any screening ormonitoring according to logic flow diagram 800 (FIG. 8) using thoseresults in processing conditions of the work-flows.

Work-flow configuration 4808 (FIG. 48) includes data representing anumber of work-flows 4810. Each work-flow 4810 includes work-flowmetadata 4812 and data representing a number of work-flow elements 4820.

Work-flow metadata 4812 is metadata of work-flow 4810 and includes datarepresenting a description 4812, an author 4816, and a schedule 4818.Description 4812 is information intended to inform any human operator ofthe nature of work-flow 4810. Author 4816 identifies the entity thatcreated work-flow 4810, whether a human administrator or automaticwork-flow generation logic 4804. Schedule 4818 specifies dates and timesand/or conditions in which work-flow execution logic 4806 is to processwork-flow 4810.

Work-flow elements 4820 collectively define the behavior of work-flowexecution logic 4806 in processing the work-flow. In this illustrativeembodiment, work-flow elements are each one of two types: conditions,such as condition 4900 (FIG. 49), and actions such as action 5000 (FIG.50).

In this illustrative embodiment, condition 4900 specifies a Boolean testthat includes an operand 4902, an operator 4904, and another operand4906. In this illustrative embodiment, operator 4904 may be any of anumber of Boolean test operators, such as =, ≠, >, ≥, <, and ≤, forexample. Operands 4902 and 4906 may each be results 1820 (FIG. 18) orany portion thereof, a constant, or null. As a result, any results of agiven screening or monitoring, e.g., results 1820, any information abouta given patient stored in screening or monitoring system data store 410,and any combination thereof may be either of operands 4902 and 4906.

Next work-flow element(s) 4908 specify one or more work-flow elements toprocess if the test of operands 4902 and 4906 and operator 4904 evaluateto a Boolean value of true, and next work-flow element(s) 4910 specifyone or more work-flow elements to process if the test of operands 4902and 4906 and operator 4904 evaluate to a Boolean value of false.

Each of next work-flow element(s) 4908 and 4910 may be any of acondition, an action, or null. By accepting conditions such as condition4900 in next work-flow element(s) 1908 and 4910, complex tests with ANDand OR operations may be represented in work-flow elements 4820. Inalternative embodiments, condition 4900 may include more operands andoperators combined with AND, OR, and NOT operations.

Since each of operands 4902 and 4906 may be null, condition 4900 maytest for the mere presence or absence of an occurrence in the patient'sdata. For example, to determine whether a patient has ever had aHemoglobin A1C blood test, condition 4900 may determine whether the mostrecent Hemoglobin A1C test results to null. If equal, the patient hasnot had any Hemoglobin A1C blood test at all.

Action 5000 (FIG. 50) includes action logic 5002 and one or more nextwork-flow element(s) 5004. Action logic 5002 represents the substantiveaction to be taken by work-flow execution logic 4806 and typically makesor recommends a particular course of action in the care of the patientthat may range from specific treatment protocols to more holisticparadigms. Examples include referring the patient to a care provider,enrolling the patient in a particular program of care, and recordingrecommendations to the patient's file such that the patient's cliniciansees the recommendation at the next visit. Examples of referring apatient to a care provider include referring the patient to apsychiatrist, a medication management coach, physical therapist,nutritionist, fitness coach, dietitian, social worker, etc. Examples ofenrolling the patient in a program include telepsychiatry programs,group therapy programs, etc.

Examples of recommendations recorded to the patient's file includerecommended changes to medication, whether a change in the particulardrug prescribed or merely in dosage of the drug already prescribed tothe patient, and other treatments. In addition, referrals and enrollmentmay be effected by recommendations for referrals and enrollment in thepatient's file, allowing a clinician to make the final decisionregarding the patient's care.

As described above, automatic work-flow generation logic 4804 (FIG. 48)performs statistical analysis of patient data stored within screening ormonitoring system data store 410 to identify work-flows to achievepredetermined goals. Examples of such goals given above includeminimizing predicted costs for the next two (2) years of a patient'scare and minimizing the cost of an initial referral while alsomaximizing a reduction in Hemoglobin A1C in one year. Automaticwork-flow generation logic 4804 is described in the illustrative contextof the first, namely, minimizing predicted costs for the next two (2)years of a patient's care.

The manner in which automatic work-flow generation logic 4804 identifieswork-flows to achieve predetermined goals is illustrated by logic flowdiagram 5100 (FIG. 51).

Automatic work-flow generation logic 4804 includes deep learning machinelogic. In step 5102, human computer engineers configure this deeplearning machine logic of automatic work-flow generation logic 4804 toanalyze patient data from screening or monitoring system data store 410in the context of labels specified by users, e.g., labels related tocosts of the care of each patient over a 2-year period in thisillustrative example. Users of health screening or monitoring server 102who are not merely patients are typically either health care providersor health care payers. In either case, information regarding events in agiven patient's health care history is available and is included inautomatic work-flow generation logic 4804 by the human engineers suchthat automatic work-flow generation logic 4804 may track costs of apatient's care from the patient's medical records.

Further in step 5102, the human engineers use all relevant data ofscreening or monitoring system data store 410 to train the deep learningmachine logic of automatic work-flow generation logic 4804. After suchtraining, the deep learning machine logic of automatic work-flowgeneration logic 4804 includes an extremely complex decision tree thatpredicts the costs of each patient over a 2-year period.

In step 5104, automatic work-flow generation logic 4804 determines whichevents in a patient's medical history have the most influence over thecost of the patient's care in a 2-year period for statisticallysignificant portions of the patient population. In particular, automaticwork-flow generation logic 4804 identifies deep learning machine (DLM)nodes of the decision tree that have the most influence over thepredetermined goals, e.g., costs of the care of a patient over a 2-yearperiod. There are several known techniques for making such adetermination automatically, and automatic work-flow generation logic4804 implements one or more of them to identify these significant nodes.Examples of techniques for identifying significantly influentialevents/decisions (“nodes” in machine learning parlance) in a deeplearning machine include random decision forests (supervised orunsupervised), multinomial logistic regression, and naïve Bayesclassifiers, for example. These techniques are known and are notdescribed herein.

Loop step 5106 and next step 5112 define a loop in which automaticwork-flow generation logic 4804 processes each of the influential nodesidentified in step 5104. In a given iteration of the loop of steps5106-5112, the particular node processed by automatic work-flowgeneration logic 4804 is sometimes referred to as the subject node.

In step 5108, automatic work-flow generation logic 4804 forms acondition, e.g., condition 4900 (FIG. 49), from the internal logic ofthe subject node. The internal logic of the subject node receives datarepresenting one or more events in a patient's history and/or one ormore phenotypes of the patient and makes a decision that represents oneor more branches to other nodes. In step 5108 (FIG. 51), automaticwork-flow generation logic 4804 generalizes the data received by thesubject node and the internal logic of the subject node that maps thereceived data to a decision.

In step 5110, automatic work-flow generation logic 4804 forms an action,e.g., action 5000 (FIG. 50), according to the branch from the subjectnode that ultimately leads to the best outcome related to thepredetermined goal, e.g., to the lowest cost over a 2-year period. Thecondition formed in step 5108 (FIG. 51) and the action formed in step5110 collectively form a work-flow generated by automatic work-flowgeneration logic 4804.

Once all influential nodes have been processed according to the loop ofsteps 5106-5112, processing by automatic work-flow generation logic 4804completes, having formed a number of work-flows.

In this illustrative embodiment, the automatically generated work-flowsare subject to human ratification prior to actual deployment withinhealth care management logic 408. In an alternative embodiment, healthcare management logic 408 automatically deploys work-flows generatedautomatically by automatic work-flow generation logic 4804 but limitsactions to only recommendations to health care professionals. It'stechnically feasible to fully automate work-flow generation and changesto a patient's care without any human supervision. However, such may becounter to health care public policy in place today.

Clinical Scenarios

The disclosed system may also be used to evaluate mental health fromprimary care health interactions. For example, the system may be used toaugment inferences about a patient's mental health taken by a trainedhealth provider individual. The system may also be used to evaluatemental health from a preliminary screening or monitoring call (e.g., acall made to a health care provider organization by a prospectivepatient for the purpose of setting up a medical appointment with atrained mental health professional). For a primary screen, the healthcare professional may ask specific questions to the patient in aparticular order to ascertain mental health treatment needs of thepatient. A recording device may record prospective patient responses toone or more of these questions. The prospective patient's consent may beobtained before this occurs.

The system may perform an audio analysis or a semantic analysis on audiosnippets it collects from the prospective patient. For example, thesystem may determine relative frequencies of words or phrases associatedwith depression. For example, the system may predict that a user hasdepression if the user speaks with terms associated with negativethoughts, such as phrases indicating suicidal thoughts, self-harminstincts, phrases indicating a poor body image or self-image, andfeelings of anxiety, isolation, or loneliness. The system may also pickup non-lexical or non-linguistic cues for depression, such as pauses,gasps, sighs, and slurred or mumbled speech. These terms and non-lexicalcues may be similar to those picked up from training examples, such aspatients administered a survey (e.g., the PHQ-9).

The system may determine information about mental health by probing auser's physical health. For example, a user may feel insecure or sadabout his or her physical features or physical fitness. Questions usedto elicit information may have to do with vitals, such as bloodpressure, resting heart rate, family history of disease, blood sugar,body mass index, body fat percentage, injuries, deformities, weight,height, eyesight, eating disorders, cardiovascular endurance, diet, orphysical strength. Patients may provide speech which indicatesdespondence, exasperation, sadness, or defensiveness. For example, apatient may provide excuses as to why he or she has not gotten a medicalprocedure performed, why his or her diet is not going well, why he orshe has not started an exercise program, or speak negatively about hisor her height, weight, or physical features. Expression of suchnegativity about one's physical health may be correlated to anxiety.

The models may be continually active or passive. A passive learningmodel may not change the method by which it learns in response to newinformation. For example, a passive learner may continually use aspecific condition to converge on a prediction, even as new types offeature information are added to the system. But such a model may belimited in effectiveness without a large amount of training dataavailable. An active learning model, by contrast, may employ a human toconverge more quickly. The active learner may ask targeted questions tothe human in order to do this. For example, a machine learning algorithmmay be employed on a large amount of unlabeled audio samples. Thealgorithm may be able to easily classify some as being indicative ofdepression, but others may be ambiguous. The algorithm may ask thepatient if he or she were feeling depressed when uttering a specificspeech segment. Or the algorithm may ask a clinician to classify thesamples.

The system may be able to perform quality assurance of health providersusing voice biomarkers. Data from the system may be provided to healthcare providers in order to assist the health care providers withdetecting lexical and non-lexical cues that correspond to depression inpatients. The health care providers may be able to use changes in pitch,vocal cadence, and vocal tics to determine how to proceed with care. Thesystem may also allow health care providers to assess which questionselicit reactions from patients that are most predictive for depressions.Health care providers may use data from the system to train one anotherto search for lexical and non-lexical cues, and monitor care delivery todetermine whether it is effective in screening or monitoring patients.For example, a health care provider may be able to observe a secondhealth care provider question a subject to determine whether the secondhealth care provider is asking questions that elicit useful informationfrom the patient. The health care provider may be asking the questionsin person or may be doing so remotely, such as from a call center.Health care providers may, using the semantic and audio informationproduced by the system, produce standardized methods of elicitinginformation from patients, based on which methods produce the most cuesfrom patients.

The system may be used to provide a dashboard tabulating voice-basedbiomarkers observed in patients. For example, health care providers maybe able to track the frequencies of specific biomarkers, in order tokeep track of patients' conditions. They may be able to track thesefrequencies in real time to assess how their treatment methods areperforming. They may also be able to track these frequencies over time,in order to monitor patients' performances under treatment or recoveryprogress. Mental health providers may be able to assess each other'sperformances using this collected data.

Dashboards may show real-time biomarker data as a snippet is beinganalyzed. They may show line graphs showing trends in measuredbiomarkers over time. The dashboards may show predictions taken atvarious time points, charting a patient's progress with respect totreatment. The dashboard may show patients' responses to treatment bydifferent providers.

The system may be able to translate one or more of its models acrossdifferent patient settings. This may be done to account for backgroundaudio information in different settings. For example, the system mayemploy one or more signal processing algorithms to normalize audio inputacross settings. This may be done by taking impulse responsemeasurements of multiple locations and determining transfer functions ofsignals collected at those locations in order to normalize audiorecordings. The system may also account for training in differentlocations. For example, a patient may feel more comfortable discussingsensitive issues at home or in a therapist's office than over the phone.Thus, voice-based biomarkers obtained in these settings may differ. Thesystem may be trained in multiple locations, or training data may belabeled by location before it is processed by the system's machinelearning algorithms.

The models may be transferred from location to location, for example, byusing signal processing algorithms. They may also be transferred bymodifying the questions asked of patients based on their locations. Forexample, it may be determined which particular questions, or sequencesof questions, correspond to particular reactions within a particularlocation context. The questions may then be administered by the healthcare providers in such fashion as to provide the same reactions from thepatients.

The system may be able to use standard clinical encounters to trainvoice biomarker models. The system may collect recordings of clinicalencounters for physical complaints. The complaints may be regardinginjuries, sicknesses, or chronic conditions. The system may record, withpatient permission, conversation patients have with health careproviders during appointments. The physical complaints may indicatepatients' feelings about their health conditions. In some cases, thephysical complaints may be causing patients significant distress,affecting their overall dispositions and possibly causing depression.

The data may be encrypted as it is collected or while in transit to oneor more servers within the system. The data may be encrypted using asymmetric-key encryption scheme, a public-key encryption scheme, or ablockchain encryption method. Calculations performed by the one or moremachine learning algorithms may be encrypted using a homomorphicencryption scheme, such as a partially homomorphic encryption scheme ora fully homomorphic encryption scheme.

The data may be analyzed locally, to protect privacy. The system mayanalyze data in real-time by implementing a trained machine learningalgorithm to operate on speech sample data recorded at the locationwhere the appointment is taking place.

Alternatively, the data may be stored locally. To preserve privacy,features may be extracted before being stored in the cloud for lateranalysis. The features may be anonymized to protect privacy. Forexample, patients may be given identifiers or pseudonyms to hide theirtrue identities. The data may undergo differential privacy to ensurethat patient identities are not compromised. Differential privacy may beaccomplished by adding noise to a data set. For example, a data set mayinclude 100 records corresponding to 100 usernames and added noise. Ifan observer has information about 99 records corresponding to 99 usersand knows the remaining username, the observer will not be able to matchthe remaining record to the remaining username, because of the noisepresent in the system.

In some embodiments, a local model may be embedded on a user device. Thelocal model may be able to perform limited machine learning orstatistical analysis, subject to constraints of device computing powerand storage. The model may also be able to perform digital signalprocessing on audio recordings from patients. The mobile device used maybe a smartphone or tablet computer. The mobile device may be able todownload algorithms over a network for analysis of local data. The localdevice may be used to ensure privacy, as data collected and analyzed maynot travel over a network.

Voice-based biomarkers may be associated with lab values orphysiological measurements. Voice-based biomarkers may be associatedwith mental health-related measurements. For example, they may becompared to the effects of psychiatric treatment, or logs taken byhealthcare professionals such as therapists. They may be compared toanswers to survey questions, to see if the voice-based analysis matchesassessments commonly made in the field.

Voice-based biomarkers may be associated with physical health-relatedmeasurements. For example, vocal issues, such as illness, may contributeto a patient producing vocal sounds that need to be accounted for inorder to produce actionable predictions. In addition, depressionpredictions over a time scale in which a patient is recovering from anillness or injury may be compared to the patient's health outcomes overthat time scale, to see if treatment is improving the patient'sdepression or depression-related symptoms. Voice-based biomarkers may becompared with data relating to brain activity collected during multipletime points, in order to determine the clinical efficacy of the system.

Training of the models may be continuous, so that the model iscontinuously running while audio data is collected. Voice-basedbiomarkers may be continually added to the system and used for trainingduring multiple epochs. Models may be updated using the data as it iscollected.

The system may use a reinforcement learning mechanism, where surveyquestions may be altered dynamically in order to elicit voice-basedbiomarkers that yield high-confidence depression predictions. Forexample, the reinforcement learning mechanism may be able to selectquestions from a group. Based on a previous question or a sequence ofprevious questions, the reinforcement mechanism may choose a questionthat may yield a high-confidence prediction of depression.

The system may be able to determine which questions or sequences ofquestions may be able to yield particular elicitations from patients.The system may use machine learning to predict a particular elicitation,by producing, for example, a probability. The system may also use asoftmax layer to produce probabilities for multiple elicitations. Thesystem may use as features particular questions as well as at what timesthese questions are asked, how long into a survey they are asked, thetime of day in which they are asked, and the point of time within atreatment course within which they are asked.

For example, a specific question asked at a specific time about asensitive subject for a patient may elicit crying from a patient. Thiscrying may be associated strongly with depression. The system may, whenreceiving context that it is the specific time, may recommendpresentation of the question to the patient.

The system may include a method of using a voice-based biomarker todynamically affect a course of treatment. The system may logelicitations of users over a period of time and determine, from thelogged elicitations, whether or not treatment has been effective. Forexample, if voice-based biomarkers become less indicative of depressionover a long time period, this might be evidence that the prescribedtreatment is working. On the other hand, if the voice-based biomarkersbecome more indicative of depression over a long time period, the systemmay prompt health care providers to pursue a change in treatment, or topursue the current course of treatment more aggressively.

The system may spontaneously recommend a change in treatment. In anembodiment where the system is continually processing and analyzingdata, the system may detect a sudden increase in voice-based biomarkersindicating depression. This may occur over a relatively short timewindow in a course of treatment. The system may also be able tospontaneously recommend a change if a course of treatment has beenineffective for a particular time period (e.g., six months, a year).

The system may be able to track a probability of a particular responseto a medication. For example, the system may be able to track voicebased biomarkers taken before, during, and after a course of treatment,and analyze changes in scores indicative of depression.

The system may be able to track a particular patient's probability ofresponse to medication by having been trained on similar patients. Thesystem may use this data to predict a patient's response based onresponses of patients from similar demographics. These demographics mayinclude age, gender, weight, height, medical history, or a combinationthereof.

The system may also be able to track a patient's likely adherence to acourse of medicine or treatment. For example, the system may be able topredict, based on analysis of time series voice-based biomarkers,whether a treatment is having an effect on a patient. The health careprovider may then ask the patient whether he or she is following thetreatment.

In addition, the system may be able to tell, based on surveying thequestions, if the patient is following the treatment by analyzing his orher biomarkers. For example, a patient may become defensive, take longpauses, stammer, or act in a manner that the patient is clearly lyingabout having adhered to a treatment plan. The patient may also expresssadness, shame, or regret regarding not having followed the treatmentplan.

The system may be able to predict whether a patient will adhere to acourse of treatment or medication. The system may be able to usetraining data from voice-based biomarkers from many patients in order tomake a prediction as to whether a patient will follow a course oftreatment. The system may identify particular voice-based biomarkers aspredicting adherence. For example, patients with voice-based biomarkersindicating dishonesty may be designated as less likely to adhere to atreatment plan.

The system may be able to establish a baseline profile for eachindividual patient.

An individual patient may have a particular style of speaking, withparticular voice-based biomarkers indicating emotions, such ashappiness, sadness, anger, and grief. For example, some people may laughwhen frustrated or cry when happy. Some people may speak loudly orsoftly, speak clearly or mumble, have large or small vocabularies, andspeak freely or more hesitantly. Some people may have extrovertedpersonalities, while others may be more introverted.

Some people may be more hesitant to speak than others. Some people maybe more guarded about expressing their feelings. Some people may haveexperienced trauma and abuse. Some people may be in denial about theirfeelings.

A person's baseline mood or mental state, and thus the person'svoice-based biomarkers, may change over time. The model may becontinually trained to account for this. The model may also predictdepression less often. The model's predictions over time may be recordedby mental health professionals. These results may be used to show apatient's progress out of a depressive state.

The system may be able to make a particular number of profiles toaccount for different types of individuals. These profiles may berelated to individuals' genders, ages, ethnicities, languages spoken,and occupations, for example.

Particular profiles may have similar voice-based biomarkers. Forexample, older people may have thinner, breathier voices than youngerpeople. Their weaker voices may make it more difficult for microphonesto pick up specific biomarkers, and they may speak more slowly thanyounger people. In addition, older people may stigmatize behavioraltherapy, and thus, not share as much information as younger peoplemight.

Men and women may express themselves differently, which may lead todifferent biomarkers. For example, men may express negative emotionsmore aggressively or violently, while women may be better able toarticulate their emotions.

In addition, people from different cultures may have different methodsof dealing with or expressing emotions, or may feel guilt and shame whenexpressing negative emotions. It may be necessary to segment peoplebased on their cultural backgrounds, in order to make the system moreeffective with respect to picking up idiosyncratic voice-basedbiomarkers.

The system may account for people with different personality types bysegmenting and clustering by personality type. This may be donemanually, as clinicians may be familiar with personality types and howpeople of those types may express feelings of depression. The cliniciansmay develop specific survey questions to elicit specific voice-basedbiomarkers from people from these segmented groups.

The voice-based biomarkers may be able to be used to determine whethersomebody is depressed, even if the person is holding back information orattempting to outsmart testing methods. This is because many of thevoice-based biomarkers may be involuntary utterances. For example, thepatient may equivocate or the patient's voice may quaver.

Particular voice-based biomarkers may correlate with particular causesof depression. For example, semantic analysis performed on manypatients, in order to find specific words, phrases, or sequences thereofthat indicate depression. The system may also track effects of treatmentoptions on users, in order to determine their efficacy. Finally, thesystem may use reinforcement learning to determine better methods oftreatment available.

Computer Figures

Real-time system 302 is shown in greater detail in (FIG. 52). Real-timesystem 302 includes one or more microprocessors 5202 (collectivelyreferred to as CPU 5202) that retrieve data and/or instructions frommemory 5204 and execute retrieved instructions in a conventional manner.Memory 5204 may include generally any computer-readable mediumincluding, for example, persistent memory such as magnetic and/oroptical disks, ROM, and PROM and volatile memory such as RAM.

CPU 5202 and memory 5204 are connected to one another through aconventional interconnect 5206, which is a bus in this illustrativeembodiment and which connects CPU 5202 and memory 5204 to one or moreinput devices 5208, output devices 5210, and network access circuitry5212. Input devices 5208 may include, for example, a keyboard, a keypad,a touch-sensitive screen, a mouse, a microphone, and one or morecameras. Output devices 5210 may include, for example, a display—such asa liquid crystal display (LCD)—and one or more loudspeakers. Networkaccess circuitry 5212 sends and receives data through computer networkssuch as network 308 (FIG. 3). Generally speaking, server computersystems often exclude input and output devices, relying instead on humanuser interaction through network access circuitry. Accordingly, in someembodiments, real-time system 302 does not include input device 708 andoutput device 5210.

A number of components of real-time system 302 are stored in memory5204. In particular, assessment test administrator 2202 and compositemodel 2204 are each all or part of one or more computer processesexecuting within CPU 5302 from memory 5304 in this illustrativeembodiment but may also be implemented using digital logic circuitry.Assessment test administrator 2202 and composite model 2204 are bothlogic. As used herein, “logic” refers to (i) logic implemented ascomputer instructions and/or data within one or more computer processesand/or (ii) logic implemented in electronic circuitry.

Assessment test configuration 5220 is data stored persistently in memory5304 and may each be implemented as all or part of one or moredatabases.

Modeling system 304 (FIG. 3) is shown in greater detail in (FIG. 53).Modeling system 304 includes one or more microprocessors 5302(collectively referred to as CPU 5302), memory 5304, an interconnect5306, input devices 5308, output devices 5310, and network accesscircuitry 5312 that are directly analogous to CPU 5202 (FIG. 52), memory5204, interconnect 5206, input devices 5208, output devices 5210, andnetwork access circuitry 5212, respectively. Being a server computersystem, modeling system 304 may omit input devices 5308 and outputdevices 5310.

A number of components of modeling system 304 (FIG. 53) are stored inmemory 5304.

In particular, modeling system logic 5320 is all or part of one or morecomputer processes executing within CPU 5302 from memory 5304 in thisillustrative embodiment but may also be implemented using digital logiccircuitry. Collected patient data 2206, clinical data 2220, and modelingsystem configuration 5322 are each data stored persistently in memory5304 and may be implemented as all or part of one or more databases.

In this illustrative embodiment, real-time system 302, modeling system304, and clinical data server 306 are shown, at least in the Figures, asseparate, single server computers. It should be appreciated that logicand data of separate server computers described herein may be combinedand implemented in a single server computer and that logic and data of asingle server computer described herein may be distributed acrossmultiple server computers. Moreover, it should be appreciated that thedistinction between servers and clients is largely an arbitrary one tofacilitate human understanding of purpose of a given computer. As usedherein, “server” and “client” are primarily labels to assist humancategorization and understanding.

Health screening or monitoring server 102 is shown in greater detail inFIG. 54. As noted above, it should be appreciated that the behavior ofhealth screening or monitoring server 102 described herein may bedistributed across multiple computer systems using conventionaldistributed processing techniques. Health screening or monitoring server102 includes one or more microprocessors 5402 (collectively referred toas CPU 5402) that retrieve data and/or instructions from memory 5404 andexecute retrieved instructions in a conventional manner. Memory 5404 mayinclude generally any computer-readable medium including, for example,persistent memory such as magnetic, solid state and/or optical disks,ROM, and PROM and volatile memory such as RAM.

CPU 5402 and memory 5404 are connected to one another through aconventional interconnect 5406, which is a bus in this illustrativeembodiment and which connects CPU 5402 and memory 5404 to one or moreinput devices 5408, output devices 5410, and network access circuitry5412. Input devices 5408 may include, for example, a keyboard, a keypad,a touch-sensitive screen, a mouse, a microphone, and one or morecameras. Output devices 5410 may include, for example, a display—such asa liquid crystal display (LCD)—and one or more loudspeakers. Networkaccess circuitry 5412 sends and receives data through computer networkssuch as WAN 110 (FIG. 1). Server computer systems often exclude inputand output devices, relying instead on human user interaction throughnetwork access circuitry exclusively.

Accordingly, in some embodiments, health screening or monitoring server102 does not include input devices 5408 and output devices 5410.

A number of components of health screening or monitoring server 102 arestored in memory 5404. In particular, interactive health screening ormonitoring logic 402 and health care management logic 408 are each allor part of one or more computer processes executing within CPU 5402 frommemory 5404. As used herein, “logic” refers to (i) logic implemented ascomputer instructions and/or data within one or more computer processesand/or (ii) logic implemented in electronic circuitry.

Screening system data store 410 and model repository 416 are each datastored persistently in memory 5404 and may be implemented as all or partof one or more databases. Screening system data store 410 also includeslogic as described above.

It should be appreciated that the distinction between servers andclients is largely an arbitrary one to facilitate human understanding ofpurpose of a given computer. As used herein, “server” and “client” areprimarily labels to assist human categorization and understanding.

The above description is illustrative only and is not limiting. Forexample, while much of the description above pertains to depression andanxiety, it should be appreciated that the techniques described hereinmay effectively estimate and/or screen for a number of other healthconditions such as post-traumatic stress disorder (PTSD) and stressgenerally, drug and alcohol addiction, and bipolar disorder, amongothers. Moreover, while the majority of the health states for whichhealth screening or monitoring server 102 screens as described hereinare mental health states or behavioral health ailments, health screeningor monitoring server 102 may screen for health states unrelated tomental or behavior health. Examples include Parkinson's disease,Alzheimer's disease, chronic obstructive pulmonary disease, liverfailure, Crohn's disease, myasthenia gravis, amyotrophic lateralsclerosis (ALS) and decompensated heart failure.

Moreover, many modifications of and/or additions to the above describedembodiment(s) are possible. For example, with patient consent,corroborative patient data for mental illness diagnostics may beextracted from one or more of the patient's biometrics including heartrate, blood pressure, respiration, perspiration, body temperature. Itmay also be possible to use audio without words, for privacy or forcross-language analysis. It is also possible to use acoustics modelingwithout visual cues.

The present invention is defined solely by the claims which follow andtheir full range of equivalents. It is intended that the followingappended claims be interpreted as including all such alterations,modifications, permutations, and substitute equivalents as fall withinthe true spirit and scope of the present invention.

Now that the systems and methods for screening or monitoring for ahealth condition, namely depression in a number of the embodiments, havebeen described, attention shall now be focused upon examples of systemscapable of executing the above functions. To facilitate this discussion,FIGS. 57 and 58 illustrate a Computer System 5700, which is suitable forimplementing embodiments of the present invention. FIG. 57 shows onepossible physical form of the Computer System 5700. Of course, theComputer System 5700 may have many physical forms ranging from a printedcircuit board, an integrated circuit, and a small handheld device up toa huge super computer, and a collection of networked computers (orcomputing components operating in a distributed network). Computersystem 5700 may include a Monitor 5702, a Display 5704, a Housing 5706,a Disk Drive 5708, a Keyboard 5710, and a Mouse 5712. Storage medium5714 is a computer-readable medium used to transfer data to and fromComputer System 5700.

FIG. 58 is an example of a block diagram 5800 for Computer System 5700.Attached to System Bus 5720 are a wide variety of subsystems.Processor(s) 5722 (also referred to as central processing units, orCPUs) are coupled to storage devices, including Memory 5724. Memory 5724includes random access memory (RAM) and read-only memory (ROM). As iswell known in the art, ROM acts to transfer data and instructionsuni-directionally to the CPU, and RAM is used typically to transfer dataand instructions in a bi-directional manner. Both of these types ofmemories may include any suitable of the computer-readable mediadescribed below. A Fixed medium 5726 may also be coupledbi-directionally to the Processor 5722; it provides additional datastorage capacity and may also include any of the computer-readable mediadescribed below. Fixed medium 5726 may be used to store programs, data,and the like and is typically a secondary storage medium (such as a harddisk) that is slower than primary storage. It will be appreciated thatthe information retained within Fixed medium 5726 may, in appropriatecases, be incorporated in standard fashion as virtual memory in Memory5724. Removable medium 5714 may take the form of any of thecomputer-readable media described below.

Processor 5722 is also coupled to a variety of input/output devices,such as Display 5704, Keyboard 5710, Mouse 5712 and Speakers 5730. Ingeneral, an input/output device may be any of: video displays, trackballs, mice, keyboards, microphones, touch-sensitive displays,transducer card readers, magnetic or paper tape readers, tablets,styluses, voice or handwriting recognizers, biometrics readers, motionsensors, motion trackers, brain wave readers, or other computers.Processor 5722 optionally may be coupled to another computer ortelecommunications network using Network Interface 5740. With such aNetwork Interface 5740, it is contemplated that the Processor 5722 mightreceive information from the network or might output information to thenetwork in the course of performing the above-described health screeningor monitoring. Furthermore, method embodiments of the present inventionmay execute solely upon Processor 5722 or may execute over a networksuch as the Internet in conjunction with a remote CPU that shares aportion of the processing.

Software is typically stored in the non-volatile memory and/or the driveunit. Indeed, for large programs, it may not even be possible to storethe entire program in the memory. Nevertheless, it should be understoodthat for software to run, if necessary, it is moved to a computerreadable location appropriate for processing, and for illustrativepurposes, that location is referred to as the memory in this disclosure.Even when software is moved to the memory for execution, the processorwill typically make use of hardware registers to store values associatedwith the software, and local cache that, ideally, serves to speed upexecution. As used herein, a software program is assumed to be stored atany known or convenient location (from non-volatile storage to hardwareregisters) when the software program is referred to as “implemented in acomputer-readable medium.” A processor is considered to be “configuredto execute a program” when at least one value associated with theprogram is stored in a register readable by the processor.

In operation, the computer system 5700 may be controlled by operatingsystem software that includes a file management system, such as a diskoperating system. One example of operating system software withassociated file management system software is the family of operatingsystems known as Windows® from Microsoft Corporation of Redmond, Wash.,and their associated file management systems. Another example ofoperating system software with its associated file management systemsoftware is the Linux operating system and its associated filemanagement system. The file management system is typically stored in thenon-volatile memory and/or drive unit and causes the processor toexecute the various acts required by the operating system to input andoutput data and to store data in the memory, including storing files onthe non-volatile memory and/or drive unit.

Some portions of the detailed description may be presented in terms ofalgorithms and symbolic representations of operations on data bitswithin a computer memory. These algorithmic descriptions andrepresentations are the approaches used by those skilled in the dataprocessing arts to most effectively convey the substance of their workto others skilled in the art. An algorithm is, here and, generally,conceived to be a self-consistent sequence of operations leading to adesired result. The operations are those requiring physicalmanipulations of physical quantities. Usually, though not necessarily,these quantities take the form of electrical or magnetic signals capableof being stored, transferred, combined, compared, and otherwisemanipulated. It has proven convenient at times, principally for reasonsof common usage, to refer to these signals as bits, values, elements,symbols, characters, terms, numbers, or the like.

The algorithms and displays presented herein are not inherently relatedto any particular computer or other apparatus. Various general-purposesystems may be used with programs in accordance with the teachingsherein, or it may prove convenient to construct more specializedapparatus to perform the methods of some embodiments. In addition, thetechniques are not described with reference to any particularprogramming language, and various embodiments may, thus, be implementedusing a variety of programming languages.

In alternative embodiments, the machine operates as a standalone deviceor may be connected (e.g., networked) to other machines. In a networkeddeployment, the machine may operate in the capacity of a server or aclient machine in a client-server network environment or as a peermachine in a peer-to-peer (or distributed) network environment.

The machine may be a server computer, a client computer, a virtualmachine, a personal computer (PC), a tablet PC, a laptop computer, aset-top box (STB), a personal digital assistant (PDA), a cellulartelephone, an iPhone, a Blackberry, a processor, a telephone, a webappliance, a network router, switch or bridge, or any machine capable ofexecuting a set of instructions (sequential or otherwise) that specifyactions to be taken by that machine.

While the machine-readable medium or machine-readable storage medium isshown in an exemplary embodiment to be a single medium, the term“machine-readable medium” and “machine-readable storage medium” shouldbe taken to include a single medium or multiple media (e.g., acentralized or distributed database, and/or associated caches andservers) that store the one or more sets of instructions. The term“machine-readable medium” and “machine-readable storage medium” shallalso be taken to include any medium that is capable of storing, encodingor carrying a set of instructions for execution by the machine and thatcause the machine to perform any one or more of the methodologies of thepresently disclosed technique and innovation.

In general, the routines executed to implement the embodiments of thedisclosure may be implemented as part of an operating system or aspecific application, component, program, object, module or sequence ofinstructions referred to as “computer programs.” The computer programstypically comprise one or more instructions set at various times invarious memory and storage devices in a computer, and when read andexecuted by one or more processing units or processors in a computer,cause the computer to perform operations to execute elements involvingthe various aspects of the disclosure.

Moreover, while embodiments have been described in the context of fullyfunctioning computers and computer systems, those skilled in the artwill appreciate that the various embodiments are capable of beingdistributed as a program product in a variety of forms, and that thedisclosure applies equally regardless of the particular type of machineor computer-readable media used to actually affect the distribution.

Additional Use Cases

The systems disclosed herein may be used to augment care provided byhealthcare providers. For example, one or more of the systems disclosedmay be used to facilitate handoffs of patients to patient careproviders. If the system, following an assessment, produces a scoreabove a threshold for a particular mental state, the system may referthe patient to a specialist for further investigation and analysis. Thepatient may be referred before the assessment has been completed, forexample, if the patient is receiving treatment in a telemedicine systemor if the specialist is co-located with the patient. For example, thepatient may be receiving treatment in a clinic with one or morespecialists.

The system disclosed may be able to direct clinical processes forpatients, following scoring. For example, if the patient were taking theassessment using a client device, the patient may, following completionof the assessment, be referred to cognitive behavioral therapy (CBT)services. They may also be referred to health care providers, or haveappointments with health care providers made by the system. The systemdisclosed may suggest one or more medications.

FIG. 59 shows an instantiation of a precision case management use casefor the system. In a first step, the patient has a conversation with acase manager. In a second step, one or more entities passively recordthe conversation, with consent of the patient. The conversation may be aface-to-face conversation. In another embodiment, the case manager mayperform the conversation remotely. For example, the conversation may bea conversation using a telemedicine platform. In a third step, real timeresults are passed to a payer. The real time results may include a scorecorresponding to a mental state. In a fourth step, the case manager mayupdate a care plan based on the real time results. For example, aparticular score that exceeds a particular threshold may influence afuture interaction between a care provider and a patient and may causethe provider to ask different questions of the patient. The score mayeven trigger the system to suggest particular questions associated withthe score. The conversation may be repeated with the updated care plan.

FIG. 60 shows an instantiation of a primary care screening or monitoringuse case for the system. In a first step, the patient visits with aprimary care provider. In a second step, speech may be captured by theprimary care provider's organization for e-transcription and the systemmay provide a copy for analysis. In a third step, the primary careprovider, from the analysis, may receive a real-time vital signinforming the care pathway. This may facilitate a warm handoff to abehavioral health specialist or may be used to direct a primary careprovider on a specific care pathway.

FIG. 61 shows an example system for enhanced employee assistance plan(EAP) navigation and triage. In a first step, the patient may call theEAP line. In a second step, the system may record audiovisual data andscreen the patient. The real time screening or monitoring results may bedelivered to the provider in real time. The provider may be able toadaptively screen the patient about high risk topics, based on thecollected real-time results. The real-time screening or monitoring datamay also be provided to other entities. For example, the real-timescreening or monitoring data may be provided to a clinician-on-call,used to schedule referrals, used for education purposes, or for otherpurposes. The interaction between the patient and EAP may be in-personor may be remote. A person staffing an EAP line may be alerted inreal-time that a patient has a positive screen and may be able to helpdirect the patient to a proper level of therapy. An EAP may also bedirected to ask questions based on a result of an assessmentadministered to a patient, for example, a score corresponding to apatient's mental state.

Speech data as described herein may be collected and analyzed inreal-time, or it may be data that is recorded and then analyzed later.

The system disclosed herein may be used to monitor interactions betweenunlicensed coaches and patients. The system may request consent from thepatients before monitoring. The coaches may be used to administerquestions. The coaches in tandem with the assessment may be able toprovide an interaction with the patient that provides actionablepredictions to clinicians and health care professionals, without beingas costly as using the services of a clinician or health care. Theassessment may be able to add rigor and robustness to judgments made bythe unlicensed coaches. The assessment may also allow more people totake jobs as coaches, as it provides a method for validating coaches'methods.

While this invention has been described in terms of several embodiments,there are alterations, modifications, permutations, and substituteequivalents, which fall within the scope of this invention. Althoughsub-section titles have been provided to aid in the description of theinvention, these titles are merely illustrative and are not intended tolimit the scope of the present invention. It should also be noted thatthere are many alternative ways of implementing the methods andapparatuses of the present invention. It is therefore intended that thefollowing appended claims be interpreted as including all suchalterations, modifications, permutations, and substitute equivalents asfall within the true spirit and scope of the present invention.

Terms and Definitions

Unless otherwise defined, all technical terms used herein have the samemeaning as commonly understood by one of ordinary skill in the art towhich this disclosure belongs.

As used herein, the singular forms “a,” “an,” and “the” include pluralreferences unless the context clearly dictates otherwise. Any referenceto “or” herein is intended to encompass “and/or” unless otherwisestated.

As used herein, the term “about” refers to an amount that is near thestated amount by 10%, 5%, or 1%, including increments therein.

As used herein, the term “about” in reference to a percentage refers toan amount that is greater or less the stated percentage by 10%, 5%, or1%, including increments therein.

As used herein, the phrases “at least one”, “one or more”, and “and/or”are open-ended expressions that are both conjunctive and disjunctive inoperation. For example, each of the expressions “at least one of A, Band C”, “at least one of A, B, or C”, “one or more of A, B, and C”, “oneor more of A, B, or C” and “A, B, and/or C” means A alone, B alone, Calone, A and B together, A and C together, B and C together, or A, B andC together.

Whenever the term “at least,” “greater than,” or “greater than or equalto” precedes the first numerical value in a series of two or morenumerical values, the term “at least,” “greater than” or “greater thanor equal to” applies to each of the numerical values in that series ofnumerical values. For example, greater than or equal to 1, 2, or 3 isequivalent to greater than or equal to 1, greater than or equal to 2, orgreater than or equal to 3.

Whenever the term “no more than,” “less than,” or “less than or equalto” precedes the first numerical value in a series of two or morenumerical values, the term “no more than,” “less than,” or “less than orequal to” applies to each of the numerical values in that series ofnumerical values. For example, less than or equal to 3, 2, or 1 isequivalent to less than or equal to 3, less than or equal to 2, or lessthan or equal to 1.

Computer Systems

The present disclosure provides computer systems that are programmed toimplement methods of the disclosure. FIG. 62 shows a computer system6201 that is programmed or otherwise configured to assess a mental stateof a subject in a single session or over multiple different sessions.The computer system 6201 can regulate various aspects of assessing amental state of a subject in a single session or over multiple differentsessions of the present disclosure, such as, for example, presentingqueries, retrieving data, and processing data. The computer system 6201can be an electronic device of a user or a computer system that isremotely located with respect to the electronic device. The electronicdevice can be a mobile electronic device.

The computer system 6201 includes a central processing unit (CPU, also“processor” and “computer processor” herein) 6205, which can be a singlecore or multi core processor, or a plurality of processors for parallelprocessing. The computer system 6201 also includes memory or memorylocation 6210 (e.g., random-access memory, read-only memory, flashmemory), electronic storage unit 6215 (e.g., hard disk), communicationinterface 6220 (e.g., network adapter) for communicating with one ormore other systems, and peripheral devices 6225, such as cache, othermemory, data storage and/or electronic display adapters. The memory6210, storage unit 6215, interface 6220 and peripheral devices 6225 arein communication with the CPU 6205 through a communication bus (solidlines), such as a motherboard. The storage unit 6215 can be a datastorage unit (or data repository) for storing data. The computer system6201 can be operatively coupled to a computer network (“network”) 6230with the aid of the communication interface 6220. The network 6230 canbe the Internet, an internet and/or extranet, or an intranet and/orextranet that is in communication with the Internet. The network 6230 insome cases is a telecommunication and/or data network. The network 6230can include one or more computer servers, which can enable distributedcomputing, such as cloud computing. The network 6230, in some cases withthe aid of the computer system 6201, can implement a peer-to-peernetwork, which may enable devices coupled to the computer system 6201 tobehave as a client or a server.

The CPU 6205 can execute a sequence of machine-readable instructions,which can be embodied in a program or software. The instructions may bestored in a memory location, such as the memory 6210. The instructionscan be directed to the CPU 6205, which can subsequently program orotherwise configure the CPU 6205 to implement methods of the presentdisclosure. Examples of operations performed by the CPU 6205 can includefetch, decode, execute, and writeback.

The CPU 6205 can be part of a circuit, such as an integrated circuit.One or more other components of the system 6201 can be included in thecircuit. In some cases, the circuit is an application specificintegrated circuit (ASIC).

The storage unit 6215 can store files, such as drivers, libraries andsaved programs. The storage unit 6215 can store user data, e.g., userpreferences and user programs. The computer system 6201 in some casescan include one or more additional data storage units that are externalto the computer system 6201, such as located on a remote server that isin communication with the computer system 6201 through an intranet orthe Internet.

The computer system 6201 can communicate with one or more remotecomputer systems through the network 6230. For instance, the computersystem 6201 can communicate with a remote computer system of a user(e.g., the clinician). Examples of remote computer systems includepersonal computers (e.g., portable PC), slate or tablet PC's (e.g.,Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g.,Apple® iPhone, Android-enabled device, Blackberry®), or personal digitalassistants. The user can access the computer system 6201 via the network6230.

Methods as described herein can be implemented by way of machine (e.g.,computer processor) executable code stored on an electronic storagelocation of the computer system 6201, such as, for example, on thememory 6210 or electronic storage unit 6215. The machine executable ormachine readable code can be provided in the form of software. Duringuse, the code can be executed by the processor 6205. In some cases, thecode can be retrieved from the storage unit 6215 and stored on thememory 6210 for ready access by the processor 6205. In some situations,the electronic storage unit 6215 can be precluded, andmachine-executable instructions are stored on memory 6210.

The code can be pre-compiled and configured for use with a machinehaving a processer adapted to execute the code, or can be compiledduring runtime. The code can be supplied in a programming language thatcan be selected to enable the code to execute in a pre-compiled oras-compiled fashion.

Aspects of the systems and methods provided herein, such as the computersystem 6201, can be embodied in programming. Various aspects of thetechnology may be thought of as “products” or “articles of manufacture”typically in the form of machine (or processor) executable code and/orassociated data that is carried on or embodied in a type of machinereadable medium. Machine-executable code can be stored on an electronicstorage unit, such as memory (e.g., read-only memory, random-accessmemory, flash memory) or a hard disk. “Storage” type media can includeany or all of the tangible memory of the computers, processors or thelike, or associated modules thereof, such as various semiconductormemories, tape drives, disk drives and the like, which may providenon-transitory storage at any time for the software programming. All orportions of the software may at times be communicated through theInternet or various other telecommunication networks. Suchcommunications, for example, may enable loading of the software from onecomputer or processor into another, for example, from a managementserver or host computer into the computer platform of an applicationserver. Thus, another type of media that may bear the software elementsincludes optical, electrical and electromagnetic waves, such as usedacross physical interfaces between local devices, through wired andoptical landline networks and over various air-links. The physicalelements that carry such waves, such as wired or wireless links, opticallinks or the like, also may be considered as media bearing the software.As used herein, unless restricted to non-transitory, tangible “storage”media, terms such as computer or machine “readable medium” refer to anymedium that participates in providing instructions to a processor forexecution.

Hence, a machine readable medium, such as computer-executable code, maytake many forms, including but not limited to, a tangible storagemedium, a carrier wave medium or physical transmission medium.Non-volatile storage media include, for example, optical or magneticdisks, such as any of the storage devices in any computer(s) or thelike, such as may be used to implement the databases, etc. shown in thedrawings. Volatile storage media include dynamic memory, such as mainmemory of such a computer platform. Tangible transmission media includecoaxial cables; copper wire and fiber optics, including the wires thatcomprise a bus within a computer system. Carrier-wave transmission mediamay take the form of electric or electromagnetic signals, or acoustic orlight waves such as those generated during radio frequency (RF) andinfrared (IR) data communications. Common forms of computer-readablemedia therefore include for example: a floppy disk, a flexible disk,hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD orDVD-ROM, any other optical medium, punch cards paper tape, any otherphysical storage medium with patterns of holes, a RAM, a ROM, a PROM andEPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wavetransporting data or instructions, cables or links transporting such acarrier wave, or any other medium from which a computer may readprogramming code and/or data. Many of these forms of computer readablemedia may be involved in carrying one or more sequences of one or moreinstructions to a processor for execution.

The computer system 6201 can include or be in communication with anelectronic display 6235 that comprises a user interface (UI) 6240 forproviding, for example, an assessment to a patient. Examples of UI'sinclude, without limitation, a graphical user interface (GUI) andweb-based user interface.

Methods and systems of the present disclosure can be implemented by wayof one or more algorithms. An algorithm can be implemented by way ofsoftware upon execution by the central processing unit 6205. Thealgorithm can, for example, analyze speech using natural languageprocessing.

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. It is not intendedthat the invention be limited by the specific examples provided withinthe specification. While the invention has been described with referenceto the aforementioned specification, the descriptions and illustrationsof the embodiments herein are not meant to be construed in a limitingsense. Numerous variations, changes, and substitutions will now occur tothose skilled in the art without departing from the invention.Furthermore, it shall be understood that all aspects of the inventionare not limited to the specific depictions, configurations or relativeproportions set forth herein which depend upon a variety of conditionsand variables. It should be understood that various alternatives to theembodiments of the invention described herein may be employed inpracticing the invention. It is therefore contemplated that theinvention shall also cover any such alternatives, modifications,variations or equivalents. It is intended that the following claimsdefine the scope of the invention and that methods and structures withinthe scope of these claims and their equivalents be covered thereby.

What is claimed is:
 1. A method for identifying whether a subject is atrisk of having a mental or physiological condition, comprising: (a)obtaining data from said subject, said data comprising speech data andoptionally associated visual data; (b) processing said data using one ormore models comprising a natural language processing (NLP) model, anacoustic model, or a visual model, to yield processed data; (c) usingsaid processed data to identify one or more features indicative of saidmental or physiological condition; and (d) outputting an electronicreport identifying whether said subject is at risk of having said mentalor physiological condition, based at least on said one or more featuresthat are identified using said processed data, which said risk isquantified in a form of a score having a confidence level provided insaid electronic report.